Skip to content

Commit d701b23

Browse files
authored
Merge pull request #12 from gjbex/development
Fixes
2 parents 91110a7 + aec988b commit d701b23

39 files changed

+7449
-127
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -113,3 +113,4 @@ venv.bak/
113113

114114
# direnv files
115115
.envrc
116+
/~$python_for_hpc.pptx

environment.yml

+12-7
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,19 @@ name: python_for_hpc
22
channels:
33
- defaults
44
dependencies:
5-
- numpy
5+
- bokeh
6+
- click
7+
- cython
68
- dask
9+
- h5py
10+
- jupyterlab
11+
- mpi4py
712
- numba
13+
- numexpr
14+
- numpy
15+
- pandas
16+
- python
817
- scipy
9-
- swig
10-
- line_profiler
11-
- cython
1218
- snakeviz
13-
- mpi4py
14-
- jupyterlab
15-
- numexpr
19+
- swig
20+
prefix: /home/gjb/miniconda3/envs/python_for_hpc

python_for_hpc.pptx

131 KB
Binary file not shown.

python_for_hpc_linux64_conda_specs.txt

+134-105
Large diffs are not rendered by default.

source-code/README.md

+11
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
This is source code that is either used in the presentation, or was developed
44
to create it. There is some material not covered in the presentation as well.
55

6+
67
## Requirements
78

89
* Python version: at least 3.6
@@ -18,6 +19,10 @@ to create it. There is some material not covered in the presentation as well.
1819
* jupyter
1920
* ipywidgets
2021

22+
* For the GPU code:
23+
* pycuda
24+
* scikit-cuda
25+
2126

2227
## What is it?
2328

@@ -34,3 +39,9 @@ to create it. There is some material not covered in the presentation as well.
3439
1. `profiling`: some illustrations and how-to on profiling a Python
3540
application.
3641
1. `pyspark`: illustrations of using PySpark.
42+
1. `hdf5`: examples of parallel I/O using HDF5.
43+
1. `numpy-scipy`: some numpy/scipy codes for benchmakring.
44+
1. `pypy`: code to experiment with the Pypy interpreter.
45+
1. `file-formats`: influcence of file formats on performance.
46+
1. `gpu`: some examples of using GPUs.
47+
1. `performance`: general considerations about performance.

source-code/file-formats/.gitignore

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# generated data files
2+
data.csv
3+
data.feather
4+
data.parquet

source-code/file-formats/README.md

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# File formats
2+
3+
The choice of file format can greatly influcnce I/O performance.
4+
5+
6+
## What is it?
7+
8+
1. `pandas_io.ipynb`: pandas dataframes can be stored in various formats, here
9+
CSV, parquet and feathers are compared for I/O bandwidth and size.
+320
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "1404febd-e901-4848-bcfd-ec8fbce5d8af",
6+
"metadata": {},
7+
"source": [
8+
"# Requirements"
9+
]
10+
},
11+
{
12+
"cell_type": "code",
13+
"execution_count": 1,
14+
"id": "a30061dd-7f54-4460-8275-cf1308d91298",
15+
"metadata": {},
16+
"outputs": [],
17+
"source": [
18+
"import numpy as np\n",
19+
"import pandas as pd"
20+
]
21+
},
22+
{
23+
"cell_type": "markdown",
24+
"id": "261228ea-1a51-4b6b-ad04-f7fd65a0e6b2",
25+
"metadata": {},
26+
"source": [
27+
"# Data set"
28+
]
29+
},
30+
{
31+
"cell_type": "markdown",
32+
"id": "34919a23-95f8-4457-adb6-c5704b1a3dfc",
33+
"metadata": {},
34+
"source": [
35+
"For benchmarking, we create a dataaframe with a size of the order of several 100 MB."
36+
]
37+
},
38+
{
39+
"cell_type": "code",
40+
"execution_count": 5,
41+
"id": "4345388c-14a8-4877-ae42-80ae4d7256d1",
42+
"metadata": {},
43+
"outputs": [],
44+
"source": [
45+
"nr_rows = 10_000_000\n",
46+
"df = pd.DataFrame({\n",
47+
" 'A': np.random.normal(size=(nr_rows, )),\n",
48+
" 'B': np.random.randint(1, high=5, size=(nr_rows, )),\n",
49+
" 'C': np.random.normal(size=(nr_rows, )),\n",
50+
"})"
51+
]
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": 6,
56+
"id": "4c56752e-4d0b-4ca1-ad5a-917a8473921d",
57+
"metadata": {},
58+
"outputs": [
59+
{
60+
"name": "stdout",
61+
"output_type": "stream",
62+
"text": [
63+
"<class 'pandas.core.frame.DataFrame'>\n",
64+
"RangeIndex: 10000000 entries, 0 to 9999999\n",
65+
"Data columns (total 3 columns):\n",
66+
" # Column Dtype \n",
67+
"--- ------ ----- \n",
68+
" 0 A float64\n",
69+
" 1 B int64 \n",
70+
" 2 C float64\n",
71+
"dtypes: float64(2), int64(1)\n",
72+
"memory usage: 228.9 MB\n"
73+
]
74+
}
75+
],
76+
"source": [
77+
"df.info()"
78+
]
79+
},
80+
{
81+
"cell_type": "markdown",
82+
"id": "dac76fc6-96cb-48b0-86b6-a778bd84f9c7",
83+
"metadata": {},
84+
"source": [
85+
"# Formats"
86+
]
87+
},
88+
{
89+
"cell_type": "markdown",
90+
"id": "d0d27f4e-dcfa-4304-9f04-8a973ecb0cd5",
91+
"metadata": {},
92+
"source": [
93+
"## CSV"
94+
]
95+
},
96+
{
97+
"cell_type": "markdown",
98+
"id": "b79fa01d-98e2-4ce5-bf92-75b392844499",
99+
"metadata": {},
100+
"source": [
101+
"CSV has the advantage that it is human-readable, but it is neither fast, nor compact."
102+
]
103+
},
104+
{
105+
"cell_type": "code",
106+
"execution_count": 14,
107+
"id": "c4e517a3-40e3-4024-870b-328f61418113",
108+
"metadata": {},
109+
"outputs": [
110+
{
111+
"name": "stdout",
112+
"output_type": "stream",
113+
"text": [
114+
"27.3 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
115+
]
116+
}
117+
],
118+
"source": [
119+
"%timeit df.to_csv('data.csv')"
120+
]
121+
},
122+
{
123+
"cell_type": "code",
124+
"execution_count": 12,
125+
"id": "fca2b3cd-6876-490f-a05a-8532428cb9fe",
126+
"metadata": {},
127+
"outputs": [
128+
{
129+
"name": "stdout",
130+
"output_type": "stream",
131+
"text": [
132+
"2.71 s ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
133+
]
134+
}
135+
],
136+
"source": [
137+
"%timeit pd.read_csv('data.csv')"
138+
]
139+
},
140+
{
141+
"cell_type": "code",
142+
"execution_count": 15,
143+
"id": "b5848d82-0e90-415f-b8cb-a30d014dd0b1",
144+
"metadata": {},
145+
"outputs": [
146+
{
147+
"name": "stdout",
148+
"output_type": "stream",
149+
"text": [
150+
"-rw-r--r-- 1 gjb gjb 469M Sep 14 14:44 data.csv\n"
151+
]
152+
}
153+
],
154+
"source": [
155+
"!ls -hl data.csv"
156+
]
157+
},
158+
{
159+
"cell_type": "markdown",
160+
"id": "e95718de-703c-475d-8253-2f95b77eeae0",
161+
"metadata": {},
162+
"source": [
163+
"## Parquet"
164+
]
165+
},
166+
{
167+
"cell_type": "markdown",
168+
"id": "e8498aa0-c1d4-4d58-99f8-8a7e98d73240",
169+
"metadata": {},
170+
"source": [
171+
"Parquet is a binary column-store format that has significantly better performance than CSV."
172+
]
173+
},
174+
{
175+
"cell_type": "code",
176+
"execution_count": 9,
177+
"id": "52480aeb-46bb-499e-8168-13113e9e1b6f",
178+
"metadata": {},
179+
"outputs": [
180+
{
181+
"name": "stdout",
182+
"output_type": "stream",
183+
"text": [
184+
"477 ms ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
185+
]
186+
}
187+
],
188+
"source": [
189+
"%timeit df.to_parquet('data.parquet')"
190+
]
191+
},
192+
{
193+
"cell_type": "code",
194+
"execution_count": 13,
195+
"id": "f5602f34-a762-4ba9-b42c-1b8525d55ab4",
196+
"metadata": {},
197+
"outputs": [
198+
{
199+
"name": "stdout",
200+
"output_type": "stream",
201+
"text": [
202+
"162 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
203+
]
204+
}
205+
],
206+
"source": [
207+
"%timeit pd.read_parquet('data.parquet')"
208+
]
209+
},
210+
{
211+
"cell_type": "code",
212+
"execution_count": 16,
213+
"id": "2887b86c-cbe5-4b9e-a1aa-4d9bcb36533b",
214+
"metadata": {},
215+
"outputs": [
216+
{
217+
"name": "stdout",
218+
"output_type": "stream",
219+
"text": [
220+
"-rw-r--r-- 1 gjb gjb 156M Sep 14 14:37 data.parquet\n"
221+
]
222+
}
223+
],
224+
"source": [
225+
"!ls data.parquet -lh"
226+
]
227+
},
228+
{
229+
"cell_type": "markdown",
230+
"id": "24195119-7729-4e78-8825-464cb29d5b91",
231+
"metadata": {},
232+
"source": [
233+
"Parquet files are also more compact than their CSV counterparts."
234+
]
235+
},
236+
{
237+
"cell_type": "markdown",
238+
"id": "25b751c0-9041-4fd5-aadb-ab4f28f3164d",
239+
"metadata": {},
240+
"source": [
241+
"## Feather"
242+
]
243+
},
244+
{
245+
"cell_type": "code",
246+
"execution_count": 17,
247+
"id": "0993c371-7f68-4f9f-94ea-d8bfce8ff8bd",
248+
"metadata": {},
249+
"outputs": [
250+
{
251+
"name": "stdout",
252+
"output_type": "stream",
253+
"text": [
254+
"473 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
255+
]
256+
}
257+
],
258+
"source": [
259+
"%timeit df.to_feather('data.feather')"
260+
]
261+
},
262+
{
263+
"cell_type": "code",
264+
"execution_count": 18,
265+
"id": "a88829b3-dcf0-4cd8-a99c-beaf14b0c6de",
266+
"metadata": {},
267+
"outputs": [
268+
{
269+
"name": "stdout",
270+
"output_type": "stream",
271+
"text": [
272+
"201 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
273+
]
274+
}
275+
],
276+
"source": [
277+
"%timeit pd.read_feather('data.feather')"
278+
]
279+
},
280+
{
281+
"cell_type": "code",
282+
"execution_count": 19,
283+
"id": "57311621-fed5-4947-af2d-5c5fec5f207c",
284+
"metadata": {},
285+
"outputs": [
286+
{
287+
"name": "stdout",
288+
"output_type": "stream",
289+
"text": [
290+
"-rw-r--r-- 1 gjb gjb 175M Sep 14 14:56 data.feather\n"
291+
]
292+
}
293+
],
294+
"source": [
295+
"!ls -hl data.feather"
296+
]
297+
}
298+
],
299+
"metadata": {
300+
"kernelspec": {
301+
"display_name": "Python 3 (ipykernel)",
302+
"language": "python",
303+
"name": "python3"
304+
},
305+
"language_info": {
306+
"codemirror_mode": {
307+
"name": "ipython",
308+
"version": 3
309+
},
310+
"file_extension": ".py",
311+
"mimetype": "text/x-python",
312+
"name": "python",
313+
"nbconvert_exporter": "python",
314+
"pygments_lexer": "ipython3",
315+
"version": "3.8.11"
316+
}
317+
},
318+
"nbformat": 4,
319+
"nbformat_minor": 5
320+
}

0 commit comments

Comments
 (0)