I ran these on the following hardware:
- Intel Xeon E5-2650 v4 @ 2.20 GHz
- 512GB DDR4 memory (not that we would need it)
- NVidia Tesla P100 (16GB memory)
Software stack:
-
CentOS 7
-
GNU compiler tookit 8.3.0
-
Python 3.7.3
-
CUDA 10.1
-
Most packages pulled from conda-forge (exceptions see below)
-
Backend versions:
bohrium==0.11.0.post19 # built from source cupy==7.6.0 jax==0.1.72 # built from source jaxlib==0.1.51 # built from source llvmlite==0.33.0 numba==0.50.1 numpy==1.19.0 pytorch==1.4.0 tensorflow==2.2.0 theano==1.0.4
An equation consisting of >100 terms with no data dependencies and only elementary math. This benchmark should represent a best-case scenario for vector instructions and GPU performance.
$ taskset -c 23 python run.py benchmarks/equation_of_state/
benchmarks.equation_of_state
============================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numba 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.013 2.977
4,096 theano 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.013 2.964
4,096 tensorflow 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.010 2.937
4,096 jax 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.012 2.584
4,096 numpy 1,000 0.002 0.001 0.002 0.002 0.002 0.002 0.013 1.000
4,096 pytorch 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.011 0.758
4,096 bohrium 100 0.055 0.001 0.054 0.054 0.055 0.055 0.058 0.031
16,384 tensorflow 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.013 3.784
16,384 jax 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.014 3.414
16,384 theano 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.011 3.103
16,384 numba 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.016 2.867
16,384 numpy 1,000 0.007 0.000 0.007 0.007 0.007 0.007 0.016 1.000
16,384 pytorch 1,000 0.007 0.000 0.007 0.007 0.007 0.007 0.015 0.999
16,384 bohrium 100 0.057 0.001 0.056 0.056 0.057 0.057 0.060 0.126
65,536 tensorflow 1,000 0.006 0.000 0.006 0.006 0.006 0.006 0.016 5.515
65,536 jax 1,000 0.007 0.000 0.007 0.007 0.007 0.007 0.008 5.042
65,536 theano 1,000 0.009 0.000 0.008 0.009 0.009 0.009 0.021 4.077
65,536 numba 1,000 0.010 0.001 0.010 0.010 0.010 0.010 0.019 3.657
65,536 pytorch 100 0.034 0.005 0.027 0.028 0.036 0.040 0.042 1.037
65,536 numpy 100 0.035 0.006 0.029 0.029 0.037 0.041 0.045 1.000
65,536 bohrium 100 0.065 0.002 0.063 0.064 0.064 0.064 0.072 0.548
262,144 tensorflow 1,000 0.021 0.001 0.020 0.021 0.021 0.021 0.030 8.316
262,144 jax 1,000 0.025 0.001 0.023 0.024 0.025 0.025 0.028 7.036
262,144 theano 100 0.031 0.001 0.030 0.031 0.031 0.031 0.034 5.647
262,144 numba 100 0.035 0.001 0.035 0.035 0.035 0.035 0.038 5.020
262,144 bohrium 100 0.091 0.002 0.090 0.091 0.091 0.091 0.100 1.924
262,144 pytorch 100 0.173 0.009 0.147 0.167 0.171 0.180 0.198 1.015
262,144 numpy 100 0.176 0.006 0.164 0.169 0.179 0.181 0.187 1.000
1,048,576 tensorflow 100 0.099 0.003 0.092 0.097 0.099 0.101 0.108 7.383
1,048,576 jax 100 0.100 0.003 0.094 0.098 0.099 0.101 0.108 7.288
1,048,576 theano 100 0.129 0.003 0.125 0.127 0.127 0.129 0.139 5.673
1,048,576 numba 100 0.145 0.003 0.144 0.144 0.144 0.145 0.160 5.021
1,048,576 bohrium 100 0.207 0.004 0.200 0.205 0.206 0.206 0.220 3.534
1,048,576 numpy 10 0.730 0.005 0.724 0.726 0.730 0.735 0.737 1.000
1,048,576 pytorch 10 0.839 0.008 0.830 0.833 0.837 0.843 0.858 0.870
4,194,304 tensorflow 10 0.389 0.009 0.379 0.386 0.387 0.388 0.413 9.164
4,194,304 jax 10 0.407 0.012 0.388 0.407 0.408 0.409 0.432 8.751
4,194,304 theano 10 0.518 0.015 0.510 0.510 0.510 0.511 0.547 6.878
4,194,304 numba 10 0.581 0.020 0.570 0.570 0.571 0.578 0.627 6.125
4,194,304 bohrium 10 0.650 0.002 0.648 0.649 0.650 0.651 0.654 5.476
4,194,304 numpy 10 3.560 0.019 3.548 3.550 3.554 3.560 3.614 1.000
4,194,304 pytorch 10 4.790 0.024 4.770 4.774 4.778 4.793 4.841 0.743
(time in wall seconds, less is better)
$ taskset -c 23 python run.py benchmarks/equation_of_state/ -s 16777216
benchmarks.equation_of_state
============================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
16,777,216 tensorflow 10 1.496 0.029 1.480 1.483 1.483 1.486 1.577 9.149
16,777,216 jax 10 1.811 0.039 1.769 1.793 1.797 1.820 1.904 7.559
16,777,216 theano 10 2.104 0.040 2.086 2.088 2.090 2.094 2.224 6.506
16,777,216 numba 10 2.117 0.023 2.105 2.106 2.109 2.110 2.183 6.466
16,777,216 bohrium 10 2.449 0.037 2.423 2.425 2.426 2.481 2.511 5.588
16,777,216 numpy 10 13.686 0.025 13.660 13.666 13.673 13.704 13.732 1.000
16,777,216 pytorch 10 18.330 0.035 18.270 18.310 18.331 18.339 18.409 0.747
(time in wall seconds, less is better)
$ for backend in bohrium cupy jax pytorch tensorflow theano; do CUDA_VISIBLE_DEVICES="0" python run.py benchmarks/equation_of_state/ --gpu -b $backend -b numpy; done
benchmarks.equation_of_state
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numpy 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.015 1.000
4,096 bohrium 100 0.056 0.001 0.055 0.055 0.055 0.055 0.061 0.029
16,384 numpy 1,000 0.007 0.000 0.007 0.007 0.007 0.007 0.012 1.000
16,384 bohrium 100 0.056 0.001 0.055 0.055 0.055 0.055 0.059 0.126
65,536 numpy 100 0.030 0.002 0.029 0.029 0.029 0.029 0.040 1.000
65,536 bohrium 100 0.056 0.001 0.055 0.055 0.055 0.055 0.061 0.531
262,144 bohrium 100 0.056 0.001 0.055 0.055 0.055 0.056 0.059 2.423
262,144 numpy 100 0.135 0.004 0.120 0.133 0.133 0.134 0.161 1.000
1,048,576 bohrium 100 0.056 0.001 0.055 0.056 0.056 0.056 0.061 13.903
1,048,576 numpy 10 0.779 0.009 0.771 0.774 0.774 0.785 0.794 1.000
4,194,304 bohrium 100 0.057 0.001 0.056 0.057 0.057 0.057 0.062 62.559
4,194,304 numpy 10 3.566 0.022 3.552 3.555 3.559 3.562 3.631 1.000
(time in wall seconds, less is better)
benchmarks.equation_of_state
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numpy 10,000 0.002 0.001 0.002 0.002 0.002 0.002 0.015 1.000
4,096 cupy 1,000 0.008 0.001 0.008 0.008 0.008 0.008 0.020 0.199
16,384 numpy 1,000 0.007 0.001 0.007 0.007 0.007 0.007 0.021 1.000
16,384 cupy 1,000 0.008 0.001 0.008 0.008 0.008 0.008 0.022 0.870
65,536 cupy 1,000 0.009 0.001 0.008 0.008 0.008 0.008 0.022 4.882
65,536 numpy 100 0.042 0.003 0.030 0.040 0.042 0.042 0.049 1.000
262,144 cupy 1,000 0.009 0.001 0.008 0.008 0.008 0.008 0.020 21.548
262,144 numpy 100 0.185 0.002 0.181 0.184 0.184 0.185 0.192 1.000
1,048,576 cupy 1,000 0.016 0.001 0.016 0.016 0.016 0.016 0.030 46.246
1,048,576 numpy 10 0.747 0.001 0.745 0.746 0.747 0.748 0.748 1.000
4,194,304 cupy 100 0.060 0.000 0.060 0.060 0.060 0.060 0.061 58.375
4,194,304 numpy 10 3.527 0.002 3.523 3.525 3.527 3.528 3.530 1.000
(time in wall seconds, less is better)
benchmarks.equation_of_state
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 jax 10,000 0.000 0.000 0.000 0.000 0.000 0.000 0.016 6.402
4,096 numpy 10,000 0.002 0.001 0.002 0.002 0.002 0.002 0.017 1.000
16,384 jax 10,000 0.000 0.001 0.000 0.000 0.000 0.000 0.016 35.237
16,384 numpy 1,000 0.010 0.002 0.007 0.009 0.009 0.012 0.022 1.000
65,536 jax 10,000 0.000 0.001 0.000 0.000 0.000 0.000 0.017 160.817
65,536 numpy 100 0.048 0.003 0.029 0.047 0.048 0.050 0.051 1.000
262,144 jax 10,000 0.000 0.001 0.000 0.000 0.000 0.000 0.016 515.477
262,144 numpy 100 0.188 0.002 0.171 0.187 0.188 0.189 0.192 1.000
1,048,576 jax 10,000 0.002 0.001 0.001 0.001 0.002 0.002 0.016 515.759
1,048,576 numpy 10 0.789 0.005 0.783 0.786 0.787 0.789 0.802 1.000
4,194,304 jax 10,000 0.001 0.001 0.001 0.001 0.001 0.001 0.017 2566.638
4,194,304 numpy 10 3.490 0.016 3.476 3.480 3.485 3.490 3.527 1.000
(time in wall seconds, less is better)
benchmarks.equation_of_state
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 pytorch 100,000 0.000 0.000 0.000 0.000 0.000 0.000 0.016 27.642
4,096 numpy 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.017 1.000
16,384 pytorch 100,000 0.000 0.000 0.000 0.000 0.000 0.000 0.017 113.228
16,384 numpy 1,000 0.007 0.000 0.007 0.007 0.007 0.007 0.019 1.000
65,536 pytorch 100,000 0.000 0.000 0.000 0.000 0.000 0.000 0.016 413.623
65,536 numpy 100 0.036 0.007 0.029 0.029 0.033 0.043 0.045 1.000
262,144 pytorch 100,000 0.000 0.000 0.000 0.000 0.000 0.000 0.015 1028.368
262,144 numpy 100 0.179 0.007 0.168 0.171 0.183 0.184 0.193 1.000
1,048,576 pytorch 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.001 1252.851
1,048,576 numpy 10 0.722 0.004 0.718 0.719 0.720 0.726 0.728 1.000
4,194,304 pytorch 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.014 1754.906
4,194,304 numpy 10 3.470 0.002 3.467 3.468 3.470 3.472 3.474 1.000
(time in wall seconds, less is better)
benchmarks.equation_of_state
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 tensorflow 10,000 0.001 0.001 0.000 0.000 0.000 0.001 0.018 3.225
4,096 numpy 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.014 1.000
16,384 tensorflow 10,000 0.001 0.001 0.000 0.001 0.001 0.001 0.016 12.167
16,384 numpy 1,000 0.008 0.001 0.006 0.007 0.007 0.008 0.012 1.000
65,536 tensorflow 10,000 0.001 0.001 0.001 0.001 0.001 0.001 0.016 62.993
65,536 numpy 100 0.046 0.004 0.042 0.044 0.045 0.047 0.078 1.000
262,144 tensorflow 10,000 0.001 0.001 0.001 0.001 0.001 0.001 0.019 144.502
262,144 numpy 100 0.191 0.007 0.180 0.187 0.189 0.193 0.224 1.000
1,048,576 tensorflow 10,000 0.003 0.000 0.003 0.003 0.003 0.003 0.014 248.496
1,048,576 numpy 10 0.807 0.010 0.796 0.801 0.804 0.815 0.823 1.000
4,194,304 tensorflow 10,000 0.009 0.000 0.009 0.009 0.009 0.009 0.023 384.760
4,194,304 numpy 10 3.616 0.079 3.543 3.564 3.590 3.624 3.809 1.000
(time in wall seconds, less is better)
benchmarks.equation_of_state
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 theano 10,000 0.000 0.001 0.000 0.000 0.000 0.000 0.013 6.897
4,096 numpy 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.012 1.000
16,384 theano 10,000 0.000 0.001 0.000 0.000 0.000 0.000 0.013 23.210
16,384 numpy 1,000 0.007 0.001 0.007 0.007 0.007 0.007 0.021 1.000
65,536 theano 10,000 0.001 0.001 0.001 0.001 0.001 0.001 0.013 65.748
65,536 numpy 100 0.038 0.002 0.030 0.037 0.037 0.039 0.044 1.000
262,144 theano 10,000 0.002 0.000 0.001 0.001 0.001 0.002 0.014 121.565
262,144 numpy 100 0.183 0.005 0.175 0.180 0.182 0.187 0.195 1.000
1,048,576 theano 1,000 0.006 0.001 0.005 0.005 0.005 0.008 0.020 130.103
1,048,576 numpy 10 0.816 0.018 0.785 0.810 0.814 0.831 0.843 1.000
4,194,304 theano 1,000 0.029 0.000 0.024 0.029 0.029 0.030 0.032 123.718
4,194,304 numpy 10 3.633 0.051 3.541 3.593 3.642 3.669 3.705 1.000
(time in wall seconds, less is better)
A more balanced routine with many data dependencies (stencil operations), and tensor shapes of up to 5 dimensions. This is the most expensive part of Veros, so in a way this is the benchmark that interests me the most.
$ taskset -c 23 python run.py benchmarks/isoneutral_mixing/
benchmarks.isoneutral_mixing
============================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numba 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.009 3.333
4,096 jax 10,000 0.002 0.001 0.002 0.002 0.002 0.002 0.036 2.046
4,096 theano 1,000 0.003 0.000 0.002 0.003 0.003 0.003 0.006 1.415
4,096 numpy 1,000 0.004 0.000 0.004 0.004 0.004 0.004 0.011 1.000
4,096 pytorch 1,000 0.007 0.000 0.007 0.007 0.007 0.007 0.011 0.557
4,096 bohrium 100 0.072 0.001 0.071 0.071 0.071 0.071 0.076 0.053
16,384 numba 1,000 0.006 0.000 0.005 0.006 0.006 0.006 0.013 2.549
16,384 jax 1,000 0.006 0.000 0.006 0.006 0.006 0.006 0.010 2.353
16,384 theano 1,000 0.010 0.000 0.010 0.010 0.010 0.010 0.015 1.435
16,384 numpy 1,000 0.014 0.000 0.014 0.014 0.014 0.014 0.020 1.000
16,384 pytorch 1,000 0.015 0.000 0.015 0.015 0.015 0.015 0.020 0.930
16,384 bohrium 100 0.078 0.001 0.077 0.077 0.077 0.078 0.082 0.184
65,536 numba 1,000 0.025 0.001 0.025 0.025 0.025 0.025 0.033 2.238
65,536 jax 1,000 0.026 0.001 0.025 0.026 0.026 0.026 0.037 2.173
65,536 theano 100 0.039 0.001 0.039 0.039 0.039 0.039 0.042 1.427
65,536 pytorch 100 0.044 0.001 0.043 0.043 0.044 0.044 0.046 1.278
65,536 numpy 100 0.056 0.001 0.055 0.056 0.056 0.056 0.062 1.000
65,536 bohrium 100 0.106 0.003 0.103 0.104 0.104 0.108 0.115 0.528
262,144 numba 100 0.098 0.004 0.095 0.095 0.096 0.098 0.114 2.359
262,144 jax 100 0.117 0.002 0.115 0.116 0.116 0.116 0.126 1.972
262,144 theano 100 0.170 0.010 0.151 0.161 0.170 0.177 0.191 1.358
262,144 pytorch 100 0.175 0.006 0.166 0.170 0.175 0.180 0.190 1.314
262,144 bohrium 100 0.210 0.005 0.204 0.206 0.208 0.214 0.221 1.099
262,144 numpy 100 0.230 0.009 0.218 0.221 0.230 0.236 0.256 1.000
1,048,576 numba 10 0.457 0.001 0.456 0.456 0.456 0.457 0.459 2.452
1,048,576 jax 10 0.532 0.001 0.531 0.531 0.532 0.533 0.533 2.106
1,048,576 bohrium 10 0.646 0.013 0.633 0.641 0.643 0.644 0.684 1.733
1,048,576 theano 10 0.760 0.004 0.756 0.758 0.759 0.759 0.770 1.475
1,048,576 pytorch 10 0.958 0.013 0.948 0.949 0.956 0.959 0.994 1.169
1,048,576 numpy 10 1.120 0.004 1.115 1.118 1.119 1.122 1.127 1.000
4,194,304 numba 10 1.886 0.024 1.851 1.867 1.885 1.904 1.931 2.529
4,194,304 jax 10 2.255 0.021 2.242 2.245 2.248 2.254 2.318 2.115
4,194,304 bohrium 10 2.374 0.024 2.349 2.358 2.363 2.382 2.433 2.009
4,194,304 theano 10 3.073 0.028 3.054 3.058 3.065 3.071 3.155 1.552
4,194,304 pytorch 10 4.756 0.188 4.556 4.563 4.752 4.910 5.007 1.003
4,194,304 numpy 10 4.769 0.017 4.752 4.757 4.762 4.779 4.799 1.000
(time in wall seconds, less is better)
$ taskset -c 23 python run.py benchmarks/isoneutral_mixing/ -s 16777216
benchmarks.isoneutral_mixing
============================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
16,777,216 numba 10 7.652 0.032 7.575 7.637 7.664 7.669 7.693 2.935
16,777,216 jax 10 8.880 0.052 8.838 8.847 8.859 8.890 9.022 2.529
16,777,216 bohrium 10 9.559 0.124 9.354 9.519 9.566 9.611 9.791 2.350
16,777,216 theano 10 12.890 0.050 12.801 12.859 12.888 12.921 12.977 1.743
16,777,216 numpy 10 22.462 0.049 22.373 22.424 22.468 22.496 22.548 1.000
16,777,216 pytorch 10 24.891 0.039 24.839 24.866 24.884 24.915 24.973 0.902
(time in wall seconds, less is better)
$ for backend in bohrium cupy jax pytorch theano; do CUDA_VISIBLE_DEVICES="0" python run.py benchmarks/isoneutral_mixing/ --gpu -b $backend -b numpy; done
benchmarks.isoneutral_mixing
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numpy 1,000 0.004 0.001 0.004 0.004 0.004 0.004 0.012 1.000
4,096 bohrium 100 0.076 0.002 0.074 0.075 0.075 0.075 0.082 0.052
16,384 numpy 1,000 0.014 0.001 0.014 0.014 0.014 0.014 0.047 1.000
16,384 bohrium 100 0.076 0.002 0.074 0.075 0.075 0.076 0.088 0.190
65,536 numpy 100 0.057 0.002 0.055 0.056 0.056 0.056 0.063 1.000
65,536 bohrium 100 0.077 0.002 0.075 0.076 0.077 0.078 0.086 0.733
262,144 bohrium 100 0.080 0.003 0.076 0.077 0.080 0.081 0.089 3.098
262,144 numpy 100 0.248 0.006 0.234 0.243 0.248 0.249 0.267 1.000
1,048,576 bohrium 100 0.092 0.005 0.086 0.086 0.094 0.095 0.103 12.269
1,048,576 numpy 10 1.127 0.013 1.118 1.120 1.122 1.128 1.165 1.000
4,194,304 bohrium 100 0.155 0.005 0.145 0.152 0.156 0.157 0.166 31.080
4,194,304 numpy 10 4.815 0.068 4.755 4.768 4.779 4.841 4.941 1.000
(time in wall seconds, less is better)
benchmarks.isoneutral_mixing
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numpy 1,000 0.004 0.001 0.004 0.004 0.004 0.004 0.010 1.000
4,096 cupy 1,000 0.011 0.001 0.010 0.011 0.011 0.011 0.015 0.356
16,384 cupy 1,000 0.011 0.001 0.010 0.011 0.011 0.011 0.017 1.320
16,384 numpy 1,000 0.014 0.000 0.014 0.014 0.014 0.014 0.018 1.000
65,536 cupy 1,000 0.011 0.001 0.011 0.011 0.011 0.011 0.017 5.163
65,536 numpy 100 0.056 0.002 0.055 0.055 0.055 0.056 0.065 1.000
262,144 cupy 1,000 0.011 0.001 0.011 0.011 0.011 0.011 0.015 21.722
262,144 numpy 100 0.245 0.005 0.234 0.241 0.245 0.247 0.261 1.000
1,048,576 cupy 1,000 0.022 0.000 0.022 0.022 0.022 0.022 0.026 50.066
1,048,576 numpy 10 1.120 0.006 1.114 1.115 1.117 1.122 1.134 1.000
4,194,304 cupy 100 0.085 0.001 0.085 0.085 0.085 0.086 0.088 55.967
4,194,304 numpy 10 4.778 0.021 4.768 4.769 4.772 4.775 4.840 1.000
(time in wall seconds, less is better)
benchmarks.isoneutral_mixing
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 jax 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.011 2.724
4,096 numpy 1,000 0.004 0.000 0.004 0.004 0.004 0.004 0.008 1.000
16,384 jax 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.007 9.922
16,384 numpy 1,000 0.015 0.001 0.014 0.014 0.014 0.015 0.027 1.000
65,536 jax 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.008 27.143
65,536 numpy 100 0.058 0.003 0.055 0.056 0.058 0.059 0.071 1.000
262,144 jax 1,000 0.006 0.000 0.006 0.006 0.006 0.006 0.011 43.892
262,144 numpy 100 0.251 0.003 0.244 0.249 0.250 0.251 0.264 1.000
1,048,576 jax 1,000 0.019 0.000 0.019 0.019 0.019 0.019 0.024 57.624
1,048,576 numpy 10 1.117 0.005 1.111 1.113 1.114 1.122 1.128 1.000
4,194,304 jax 100 0.072 0.000 0.071 0.071 0.072 0.072 0.073 66.022
4,194,304 numpy 10 4.728 0.023 4.692 4.707 4.741 4.745 4.749 1.000
(time in wall seconds, less is better)
benchmarks.isoneutral_mixing
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numpy 1,000 0.004 0.000 0.004 0.004 0.004 0.004 0.010 1.000
4,096 pytorch 1,000 0.008 0.001 0.008 0.008 0.008 0.008 0.015 0.483
16,384 pytorch 1,000 0.008 0.001 0.008 0.008 0.008 0.008 0.013 1.793
16,384 numpy 1,000 0.014 0.001 0.014 0.014 0.014 0.014 0.019 1.000
65,536 pytorch 1,000 0.008 0.001 0.008 0.008 0.008 0.008 0.014 6.817
65,536 numpy 100 0.056 0.002 0.055 0.055 0.055 0.056 0.065 1.000
262,144 pytorch 1,000 0.008 0.001 0.008 0.008 0.008 0.008 0.012 29.159
262,144 numpy 100 0.246 0.004 0.224 0.242 0.247 0.249 0.259 1.000
1,048,576 pytorch 1,000 0.020 0.000 0.020 0.020 0.020 0.020 0.024 56.054
1,048,576 numpy 10 1.115 0.003 1.113 1.113 1.114 1.116 1.122 1.000
4,194,304 pytorch 100 0.074 0.001 0.074 0.074 0.074 0.074 0.079 64.054
4,194,304 numpy 10 4.732 0.004 4.726 4.730 4.733 4.735 4.739 1.000
(time in wall seconds, less is better)
benchmarks.isoneutral_mixing
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 theano 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.007 2.240
4,096 numpy 1,000 0.004 0.000 0.004 0.004 0.004 0.004 0.008 1.000
16,384 theano 10,000 0.003 0.000 0.002 0.002 0.002 0.003 0.007 5.480
16,384 numpy 1,000 0.014 0.000 0.014 0.014 0.014 0.014 0.019 1.000
65,536 theano 1,000 0.006 0.000 0.006 0.006 0.006 0.007 0.013 9.001
65,536 numpy 100 0.056 0.001 0.055 0.055 0.055 0.056 0.061 1.000
262,144 theano 1,000 0.018 0.002 0.017 0.017 0.017 0.019 0.032 12.259
262,144 numpy 100 0.226 0.011 0.218 0.220 0.222 0.225 0.279 1.000
1,048,576 theano 100 0.103 0.007 0.085 0.099 0.100 0.113 0.115 10.890
1,048,576 numpy 10 1.127 0.021 1.110 1.113 1.120 1.124 1.174 1.000
4,194,304 theano 10 0.386 0.018 0.380 0.380 0.380 0.381 0.439 12.280
4,194,304 numpy 10 4.741 0.020 4.723 4.729 4.733 4.739 4.781 1.000
(time in wall seconds, less is better)
This routine consists of some stencil operations and some linear algebra (a tridiagonal matrix solver), which cannot be vectorized.
$ taskset -c 23 python run.py benchmarks/turbulent_kinetic_energy/
benchmarks.turbulent_kinetic_energy
===================================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 jax 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.007 2.001
4,096 numba 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.005 1.852
4,096 numpy 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.007 1.000
4,096 bohrium 10 0.048 0.001 0.046 0.047 0.047 0.049 0.050 0.048
16,384 jax 10,000 0.003 0.000 0.002 0.003 0.003 0.003 0.013 2.728
16,384 numba 1,000 0.004 0.000 0.004 0.004 0.004 0.004 0.009 1.773
16,384 numpy 1,000 0.007 0.000 0.007 0.007 0.007 0.007 0.011 1.000
16,384 bohrium 100 0.049 0.001 0.048 0.048 0.048 0.049 0.052 0.148
65,536 jax 1,000 0.010 0.000 0.009 0.010 0.010 0.010 0.013 2.669
65,536 numba 1,000 0.013 0.000 0.013 0.013 0.013 0.013 0.018 2.009
65,536 numpy 1,000 0.026 0.001 0.026 0.026 0.026 0.026 0.031 1.000
65,536 bohrium 100 0.056 0.001 0.054 0.055 0.055 0.057 0.059 0.465
262,144 numba 100 0.046 0.002 0.042 0.043 0.047 0.047 0.050 2.585
262,144 jax 100 0.051 0.002 0.044 0.051 0.051 0.052 0.054 2.319
262,144 bohrium 10 0.085 0.003 0.079 0.086 0.086 0.087 0.089 1.385
262,144 numpy 100 0.118 0.009 0.099 0.116 0.123 0.124 0.130 1.000
1,048,576 numba 100 0.178 0.004 0.175 0.176 0.177 0.178 0.195 3.026
1,048,576 bohrium 100 0.198 0.005 0.193 0.194 0.197 0.203 0.213 2.721
1,048,576 jax 100 0.250 0.002 0.247 0.249 0.250 0.251 0.261 2.154
1,048,576 numpy 10 0.539 0.006 0.535 0.537 0.537 0.538 0.557 1.000
4,194,304 bohrium 10 0.643 0.006 0.633 0.639 0.645 0.646 0.652 3.194
4,194,304 numba 10 0.683 0.007 0.675 0.677 0.681 0.690 0.693 3.005
4,194,304 jax 10 1.155 0.009 1.145 1.148 1.151 1.162 1.172 1.778
4,194,304 numpy 10 2.053 0.018 2.032 2.041 2.046 2.062 2.095 1.000
(time in wall seconds, less is better)
$ taskset -c 23 python run.py benchmarks/turbulent_kinetic_energy/ -s 16777216
benchmarks.turbulent_kinetic_energy
===================================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
16,777,216 bohrium 10 2.386 0.011 2.370 2.381 2.384 2.386 2.410 4.315
16,777,216 numba 10 2.629 0.032 2.598 2.607 2.615 2.641 2.710 3.917
16,777,216 jax 10 4.397 0.016 4.379 4.388 4.391 4.399 4.436 2.342
16,777,216 numpy 10 10.297 0.092 10.217 10.234 10.274 10.280 10.476 1.000
(time in wall seconds, less is better)
$ for backend in bohrium jax; do CUDA_VISIBLE_DEVICES="0" python run.py benchmarks/turbulent_kinetic_energy/ --gpu -b $backend -b numpy; done
benchmarks.turbulent_kinetic_energy
===================================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numpy 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.012 1.000
4,096 bohrium 100 0.048 0.002 0.047 0.048 0.048 0.048 0.061 0.043
16,384 numpy 1,000 0.007 0.000 0.007 0.007 0.007 0.007 0.009 1.000
16,384 bohrium 100 0.048 0.002 0.047 0.048 0.048 0.049 0.061 0.145
65,536 numpy 100 0.026 0.001 0.025 0.026 0.026 0.026 0.032 1.000
65,536 bohrium 100 0.049 0.002 0.048 0.049 0.049 0.050 0.061 0.529
262,144 bohrium 100 0.053 0.006 0.049 0.049 0.052 0.055 0.082 1.998
262,144 numpy 100 0.106 0.005 0.099 0.101 0.103 0.111 0.128 1.000
1,048,576 bohrium 10 0.064 0.003 0.054 0.065 0.065 0.066 0.067 8.519
1,048,576 numpy 10 0.548 0.014 0.532 0.536 0.546 0.555 0.578 1.000
4,194,304 bohrium 100 0.091 0.010 0.082 0.083 0.091 0.094 0.137 23.098
4,194,304 numpy 10 2.099 0.094 2.029 2.038 2.080 2.099 2.363 1.000
(time in wall seconds, less is better)
benchmarks.turbulent_kinetic_energy
===================================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 jax 1,000 0.002 0.000 0.001 0.002 0.002 0.002 0.008 1.303
4,096 numpy 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.010 1.000
16,384 jax 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.010 4.141
16,384 numpy 1,000 0.007 0.000 0.007 0.007 0.007 0.008 0.013 1.000
65,536 jax 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.008 12.552
65,536 numpy 100 0.029 0.003 0.025 0.026 0.029 0.032 0.035 1.000
262,144 jax 1,000 0.004 0.001 0.004 0.004 0.004 0.004 0.009 31.513
262,144 numpy 100 0.126 0.008 0.106 0.120 0.126 0.129 0.158 1.000
1,048,576 jax 1,000 0.012 0.000 0.012 0.012 0.012 0.012 0.018 44.869
1,048,576 numpy 10 0.550 0.008 0.543 0.544 0.547 0.554 0.570 1.000
4,194,304 jax 100 0.047 0.000 0.047 0.047 0.047 0.047 0.049 44.246
4,194,304 numpy 10 2.090 0.034 2.061 2.064 2.075 2.099 2.156 1.000
(time in wall seconds, less is better)