Built site for gh-pages

nrdg · Apr 30, 2024 · fcac7bd · fcac7bd
1 parent e66b038
commit fcac7bd
Show file tree

Hide file tree

Showing 44 changed files with 4,978 additions and 1,134 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-0e655726
+1abc1ad1
diff --git a/_tex/figures/cost_vs_cpus.png b/_tex/figures/cost_vs_cpus.png
diff --git a/_tex/figures/csdm_efficency.png b/_tex/figures/csdm_efficency.png
diff --git a/_tex/figures/csdm_speedup.png b/_tex/figures/csdm_speedup.png
diff --git a/_tex/figures/efficency_vs_cpus.png b/_tex/figures/efficency_vs_cpus.png
diff --git a/_tex/figures/fwdtim_efficency.png b/_tex/figures/fwdtim_efficency.png
diff --git a/_tex/figures/fwdtim_speedup.png b/_tex/figures/fwdtim_speedup.png
diff --git a/_tex/index.tex b/_tex/index.tex
@@ -189,8 +189,30 @@ \section{Abstract}\label{abstract}
 
 \section{Introduction}\label{introduction}
 
-Ray is a great system for parallelization
-(https://arxiv.org/abs/1712.05889).
+Dipy is a popular open-source Python library used for the analysis of
+diffusion imaging data. It provides tools for preprocessing,
+reconstruction, and analysis of MRI data. Here we focused on three
+different reconstruction (XXX need to finish testing of sfm model )
+models included in Dipy, the constrained spherical deconvolution, free
+water diffusion tensor and sparse facile models. These reconstruction
+models, along with several others not tested here are good candidates
+for parallel computing, as they are independent on the voxel level.
+While in theory parallelizing these workloads should be a fairly simple
+task, due to pythons GIL (global interpreter lock), it can prove more
+difficult in practice. To work around pythons GIL we utilized the
+library Ray, which is a great system for parallelization of python
+(https://arxiv.org/abs/1712.05889). In preliminary testing, we looked at
+three different libraries to accomplish this task, Joblib, Dask, and
+Ray, but ray quickly proved to be both the most performant, as well as
+user-friendly and reliable option of the three. Ray's approach to
+serialization, the process of converting Python objects into a format
+that can be easily stored and transmitted (XXX improve definition of
+serialization), also proved to be the least prone to errors for our use
+case.
+
+(XXX this was written as a word dump, some of this might need to be
+moved to discussion or methods) Ray is a great system for
+parallelization (https://arxiv.org/abs/1712.05889).
 
 \section{Methods}\label{methods}
 
@@ -200,9 +222,9 @@ \section{Methods}\label{methods}
 encapsulate the test and allow for easy reproducibility of the tests.
 The testing program computes each model 5 times for each set of unique
 parameters. We then iterate across chunk sizes exponentially, from 1-15,
-where the number of chunks is 2\^{}x (explain better). We ran the tests
-with the following arguments on docker instances with CPU counts, 8, 16,
-32, 48, and 72:
+where the number of chunks is 2\^{}x (XXX explain better). We ran the
+tests with the following arguments on docker instances with CPU counts,
+8, 16, 32, 48, and 72:
 
 \begin{verbatim}
 --models csdm fwdtim --min_chunks 1 --max_chunks 15 --num_runs 5
@@ -211,67 +233,49 @@ \section{Methods}\label{methods}
 \section{Results}\label{results}
 
 Parallelization with \texttt{ray} provided considerable speedups over
-serial excicution for both constrained sperical deconvolution models and
+serial execution for both constrained spherical deconvolution models and
 free water models. We saw a much greater speedup for the free water
 model, which is possibly explained by the fact that it is much more
 computationally expensive per voxel. This would mean that the overhead
 from parallelizing the model would have a smaller effect on the runtime.
-Interestlingly 48 and 72 core instances performed slightly worse than
-the 32 core instance on the csdm model, which may indicate that there is
+Interestingly 48 and 72 core instances performed slightly worse than the
+32 core instances on the csdm model, which may indicate that there is
 some increased overhead for each core, separate from the overhead for
 each task sent to ray.
 
-\begin{longtable}[]{@{}
-  >{\raggedright\arraybackslash}p{(\columnwidth - 2\tabcolsep) * \real{0.5000}}
-  >{\raggedright\arraybackslash}p{(\columnwidth - 2\tabcolsep) * \real{0.5000}}@{}}
-\toprule\noalign{}
-\endhead
-\bottomrule\noalign{}
-\endlastfoot
 \includegraphics[width=0.8\textwidth,height=0.8\textheight]{figures/csdm_speedup.png}
-&
-\includegraphics[width=0.8\textwidth,height=0.8\textheight]{figures/fwdtim_speedup.png} \\
-\end{longtable}
+\includegraphics[width=0.8\textwidth,height=0.8\textheight]{figures/fwdtim_speedup.png}
 
 Efficiency decreases as a function of number of CPUs, but is still
 rather high in many configurations. Efficiency is also considerably
-higher for the free water tensor model, which is consistent with out
+higher for the free water tensor model, which is consistent with our
 expectations given that it is more computationally expensive per voxel
-and therefor ray overhead would have less effect. The high efficency of
-8 core machines suggest that the most cost effective configuration for
-processing may be relativly cheap low core machines.
-
-\begin{longtable}[]{@{}
-  >{\raggedright\arraybackslash}p{(\columnwidth - 2\tabcolsep) * \real{0.5000}}
-  >{\raggedright\arraybackslash}p{(\columnwidth - 2\tabcolsep) * \real{0.5000}}@{}}
-\toprule\noalign{}
-\endhead
-\bottomrule\noalign{}
-\endlastfoot
-\includegraphics[width=0.8\textwidth,height=0.8\textheight]{figures/csdm_efficency.png}
-&
-\includegraphics[width=0.8\textwidth,height=0.8\textheight]{figures/fwdtim_efficency.png} \\
-\end{longtable}
-
-XXX Plot peak efficiency as a function of number of CPUs for the two
-models. The slope is probably related to the cost-per-voxel of each
-model (a lot higher for FWDTI).
-
-Ray tends to spill a large amount of data to disk and does not clean up
-afterwards. This can quickly become problematic when running multiple
-consecuitive models. Withing just an hour or two of running ray could
-easily spill over 500gb to disk. We have implemented a fix for this
-within our model as follows:
-
-There seems to be an inverse relationship between the computational cost
-per voxel and the speedup that you get from parallelization. This is why
-CSD speedup is maximal for 32 cores.
+and therefore ray overhead would have less effect. The high efficiency
+of 8 core machines suggests that the most cost-effective configuration
+for processing may be relatively cheap low core machines.
 
-XXX We should try to make a theoretical guesstimate of the cost (in \$)
-per model with the cost of different machines in mind, making some
-assumptions about the differences between a 32-core and a 72-core
-machine. We might still come out ahead using 72 CPU machines, given the
-cost differential in this kind of calculation..
+\includegraphics[width=0.8\textwidth,height=0.8\textheight]{figures/csdm_efficency.png}
+\includegraphics[width=0.8\textwidth,height=0.8\textheight]{figures/fwdtim_efficency.png}
+
+We can also look at peak efficiency per core (efficiency at the optimal
+number of chunks for the given parameters), relative to the number of
+cores for both models. What's interesting is that we see a very similar
+relationship between both models, with the fwdti model being higher by
+almost the same amount for all core counts. This suggests that models
+such as fwdti that are more computationally expensive per voxel will see
+better speedups due to the overhead of parallelization being lower
+relative to the total cost. Interestingly increasing core counts doesn't
+further increase the benefit of parallelization relative to overhead,
+which suggests that ray overhead may be very linearly related to the
+number of cores.
+
+\includegraphics{figures/efficency_vs_cpus.png}
+
+Ray tends to spill a large amount of data to the disk and does not clean
+up afterward. This can quickly become problematic when running multiple
+consecutive models. Within just an hour or two of running, Ray could
+easily spill over 500gb to disk. We have implemented a quick fix for
+this within our model as follows:
 
 \begin{Shaded}
 \begin{Highlighting}[]
@@ -299,6 +303,24 @@ \section{Results}\label{results}
 \end{Highlighting}
 \end{Shaded}
 
+There seems to be an inverse relationship between the computational cost
+per voxel and the speedup that you get from parallelization. This is why
+CSD speedup is maximal for 32 cores.
+
+We have also made a rough approximation of the total cost of computation
+relative to the number of CPUs. Because all tests were run on a
+`'c5.18xlarge'' machine, and the docker container was simply limited in
+its access to cores, This approximation makes the following assumptions
+to estimate the cost of using smaller machines: It assumes that the only
+differentiating factor between aws c5 machines' performance is the
+number of CPUs, which may not be true for several reasons, such as total
+memory available, memory bandwidth, and single-core performance. With
+this approximation, we see that cost increases as a function of CPUs.
+This suggests that using the smallest machine that still computes in a
+reasonable amount of time is likely the best option.
+
+\includegraphics[width=0.8\textwidth,height=0.8\textheight]{figures/cost_vs_cpus.png}
+
 \section{Discussion}\label{discussion}
 
 \subsection{Acknowledgments}\label{acknowledgments}

diff --git a/figures/cost_vs_cpus.png b/figures/cost_vs_cpus.png
diff --git a/figures/csdm_efficency.png b/figures/csdm_efficency.png
diff --git a/figures/csdm_speedup.png b/figures/csdm_speedup.png
diff --git a/figures/efficency_vs_cpus.png b/figures/efficency_vs_cpus.png
diff --git a/figures/fwdtim_efficency.png b/figures/fwdtim_efficency.png
diff --git a/figures/fwdtim_speedup.png b/figures/fwdtim_speedup.png
diff --git a/figures/graphs.ipynb b/figures/graphs.ipynb