gjbex
diff --git a/‎.gitignore
+1 b/‎.gitignore
+1
diff --git a/‎environment.yml
+12-7 b/‎environment.yml
+12-7
diff --git a/‎python_for_hpc.pptx
131 KB b/‎python_for_hpc.pptx
131 KB
diff --git a/‎python_for_hpc_linux64_conda_specs.txt
+134-105 b/‎python_for_hpc_linux64_conda_specs.txt
+134-105
diff --git a/‎source-code/README.md
+11 b/‎source-code/README.md
+11
diff --git a/‎source-code/file-formats/.gitignore
+4 b/‎source-code/file-formats/.gitignore
+4
diff --git a/‎source-code/file-formats/README.md
+9 b/‎source-code/file-formats/README.md
+9
diff --git a/‎source-code/file-formats/pandas_io.ipynb
+320 b/‎source-code/file-formats/pandas_io.ipynb
+320
@@ -113,3 +113,4 @@ venv.bak/
 
 # direnv files
 .envrc 
+/~$python_for_hpc.pptx
@@ -2,14 +2,19 @@ name: python_for_hpc
 channels:
   - defaults
 dependencies:
-  - numpy
+  - bokeh
+  - click
+  - cython
   - dask
+  - h5py
+  - jupyterlab
+  - mpi4py
   - numba
+  - numexpr
+  - numpy
+  - pandas
+  - python
   - scipy
-  - swig
-  - line_profiler
-  - cython
   - snakeviz
-  - mpi4py
-  - jupyterlab
-  - numexpr
+  - swig
+prefix: /home/gjb/miniconda3/envs/python_for_hpc
@@ -3,6 +3,7 @@
 This is source code that is either used in the presentation, or was developed
 to create it.  There is some material not covered in the presentation as well.
 
+
 ## Requirements
 
 * Python version: at least 3.6
@@ -18,6 +19,10 @@ to create it.  There is some material not covered in the presentation as well.
   * jupyter
   * ipywidgets
 
+* For the GPU code:
+  * pycuda
+  * scikit-cuda
+
 
 ## What is it?
 
@@ -34,3 +39,9 @@ to create it.  There is some material not covered in the presentation as well.
 1. `profiling`: some illustrations and how-to on profiling a Python
    application.
 1. `pyspark`: illustrations of using PySpark.
+1. `hdf5`: examples of parallel I/O using HDF5.
+1. `numpy-scipy`: some numpy/scipy codes for benchmakring.
+1. `pypy`: code to experiment with the Pypy interpreter.
+1. `file-formats`: influcence of file formats on performance.
+1. `gpu`: some examples of using GPUs.
+1. `performance`: general considerations about performance.
@@ -0,0 +1,4 @@
+# generated data files
+data.csv
+data.feather
+data.parquet
@@ -0,0 +1,9 @@
+# File formats
+
+The choice of file format can greatly influcnce I/O performance.
+
+
+## What is it?
+
+1. `pandas_io.ipynb`: pandas dataframes can be stored in various formats, here
+   CSV, parquet and feathers are compared for I/O bandwidth and size.
@@ -0,0 +1,320 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1404febd-e901-4848-bcfd-ec8fbce5d8af",
+   "metadata": {},
+   "source": [
+    "# Requirements"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "a30061dd-7f54-4460-8275-cf1308d91298",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "261228ea-1a51-4b6b-ad04-f7fd65a0e6b2",
+   "metadata": {},
+   "source": [
+    "# Data set"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34919a23-95f8-4457-adb6-c5704b1a3dfc",
+   "metadata": {},
+   "source": [
+    "For benchmarking, we create a dataaframe with a size of the order of several 100 MB."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "4345388c-14a8-4877-ae42-80ae4d7256d1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nr_rows = 10_000_000\n",
+    "df = pd.DataFrame({\n",
+    "    'A': np.random.normal(size=(nr_rows, )),\n",
+    "    'B': np.random.randint(1, high=5, size=(nr_rows, )),\n",
+    "    'C': np.random.normal(size=(nr_rows, )),\n",
+    "})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "4c56752e-4d0b-4ca1-ad5a-917a8473921d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 10000000 entries, 0 to 9999999\n",
+      "Data columns (total 3 columns):\n",
+      " #   Column  Dtype  \n",
+      "---  ------  -----  \n",
+      " 0   A       float64\n",
+      " 1   B       int64  \n",
+      " 2   C       float64\n",
+      "dtypes: float64(2), int64(1)\n",
+      "memory usage: 228.9 MB\n"
+     ]
+    }
+   ],
+   "source": [
+    "df.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dac76fc6-96cb-48b0-86b6-a778bd84f9c7",
+   "metadata": {},
+   "source": [
+    "# Formats"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0d27f4e-dcfa-4304-9f04-8a973ecb0cd5",
+   "metadata": {},
+   "source": [
+    "## CSV"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b79fa01d-98e2-4ce5-bf92-75b392844499",
+   "metadata": {},
+   "source": [
+    "CSV has the advantage that it is human-readable, but it is neither fast, nor compact."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "c4e517a3-40e3-4024-870b-328f61418113",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "27.3 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit df.to_csv('data.csv')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "fca2b3cd-6876-490f-a05a-8532428cb9fe",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2.71 s ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit pd.read_csv('data.csv')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "b5848d82-0e90-415f-b8cb-a30d014dd0b1",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "-rw-r--r-- 1 gjb gjb 469M Sep 14 14:44 data.csv\n"
+     ]
+    }
+   ],
+   "source": [
+    "!ls -hl data.csv"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e95718de-703c-475d-8253-2f95b77eeae0",
+   "metadata": {},
+   "source": [
+    "## Parquet"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e8498aa0-c1d4-4d58-99f8-8a7e98d73240",
+   "metadata": {},
+   "source": [
+    "Parquet is a binary column-store format that has significantly better performance than CSV."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "52480aeb-46bb-499e-8168-13113e9e1b6f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "477 ms ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit df.to_parquet('data.parquet')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "f5602f34-a762-4ba9-b42c-1b8525d55ab4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "162 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit pd.read_parquet('data.parquet')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "2887b86c-cbe5-4b9e-a1aa-4d9bcb36533b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "-rw-r--r-- 1 gjb gjb 156M Sep 14 14:37 data.parquet\n"
+     ]
+    }
+   ],
+   "source": [
+    "!ls data.parquet -lh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "24195119-7729-4e78-8825-464cb29d5b91",
+   "metadata": {},
+   "source": [
+    "Parquet files are also more compact than their CSV counterparts."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "25b751c0-9041-4fd5-aadb-ab4f28f3164d",
+   "metadata": {},
+   "source": [
+    "## Feather"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "0993c371-7f68-4f9f-94ea-d8bfce8ff8bd",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "473 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit df.to_feather('data.feather')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "a88829b3-dcf0-4cd8-a99c-beaf14b0c6de",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "201 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit pd.read_feather('data.feather')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "57311621-fed5-4947-af2d-5c5fec5f207c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "-rw-r--r-- 1 gjb gjb 175M Sep 14 14:56 data.feather\n"
+     ]
+    }
+   ],
+   "source": [
+    "!ls -hl data.feather"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
Original file line number	Diff line number	Diff line change
`@@ -113,3 +113,4 @@ venv.bak/`
`113`	`113`
`114`	`114`	`# direnv files`
`115`	`115`	`.envrc`
	`116`	`+/~$python_for_hpc.pptx`