next up previous contents
Next:  Testing the executable Up: Installation Previous:  Building from source   Contents

 Parallel Qs

Building and using the parallel version of QS requires some familiarity both with some details of your computing environment and with using QS via a script (and not in its automatic mode). I suspect that non-adventurous first-time users would rather prefer skipping this section and use the uniprocessor version of the program (in its automatic mode).

Version 1.3 of QS contains an implementation of a fine-grained parallelisation of the program based on MPI. Because this is a new feature and because you will have to built QS from source anyway, I have kept everything related to parallelisation in this section. The rest of the documentation should apply identically to the parallel version with the exception of the command line (or script) needed to run a parallel program on your cluster. This section also contains results from some tests that I performed on two Beowulf clusters. To cut a long story short (and to save you from possibly unnecessary efforts) :
  1. You need a high-speed interconnect (Gigabit ethernet or better). You may get an acceptable scale-up even from a cluster based on a fast (100 Mbps) ethernet, but only for problems for which the target structure belongs to a high symmetry spacegroup.
  2. Even with a Gigabit ethernet interconnect the scale-up will be far from ideal for low symmetry problems.
  3. For most problems the optimum number of processors will be around 4 to 8 processors per minimisation (and so, you should not expect to submit a QS job to your 256-node brand-new-cluster with something like mpirun -np 256).

The bright side of these findings are :

Building it : I have tried to keep the parallelisation transparent with respect to the uniprocessor version of the program which means that in the absence of a suitable define (passed to the compiler) there should be no difference from the uniprocessor version. Compiling and linking the parallel version of QS depends on your cluster set-up, but ideally (and if you have a properly set-up LAM or MPICH implementation) you could simply say something like :

mpicc -DMPI -o Qs_MPI Qs.c -lsrfftw -lsfftw -lm

(which assumes that mpicc already known which compiler and optimisation flags are best for your cluster). The line shown above also implies that you already have the FFTW libraries ready to go (note that you do not need a parallel version of the FFTW libraries). Once you have an executable you can quickly test it using the files in the example/ directory of the distribution with something in the spirit of :

mpirun -np 4 Qs_MPI example.in

If all goes as planned, just copy the parallel version of the program somewhere in the users path with a suitable name and you're done (installation-wise, the rest of the steps are problem-specific).

Determination of 'best' number of processors : The best number of processors for any given problem depends mainly on two parameters : the number of unique reflections and the number of crystallographic symmetry operators (for the given space group). In general, the program will perform better as the number of unique reflections decreases and the symmetry increases. To get an idea of what to expect for your problem, I wasted some CPU time on a dual Athlon Beowulf with a Gigabit Ethernet to run a few tests with problems of different sizes. In all cases, I used short runs (200,000 steps each) and I assumed that the machines were otherwise idle. The results are shown below (wallclock times in seconds) :

   P422, 4692 refl.     P422, 9383 refl.     P422, 18698 refl.     P21, 9383 refl.
  
  Proc Wall Scale-up   Proc Wall Scale-up   Proc Wall Scale-up   Proc Wall Scale-up
   1   4909  1.0000     1   8421  1.0000     1  15396  1.0000     1   2386  1.0000
   2   2427  1.9636     2   4733  1.7792     2   8888  1.7322     2   1510  1.5801
   3   1850  2.6535     3   3464  2.4310     3   6311  2.4395     3   1218  1.9589
   4   1474  3.3303     4   2723  3.0925     4   4967  3.0996     4   1039  2.2964
   5   1236  3.9716     5   2315  3.6375     5   4297  3.5829     5    922  2.5878
   6   1069  4.5921     6   2021  4.1667     6   3696  4.1655     6    844  2.8270
   7    959  5.1188     7   1792  4.6992     7   3466  4.4420     7   2330  1.0240
   8    867  5.6620     8   1644  5.1222     8   3418  4.5043
   9    792  6.1982     9   1516  5.5547     9   3625  4.2471
  10    743  6.6069    10   1466  5.7442    10   3656  4.2111       
  12    661  7.4266    12   1553  5.4224
  14    598  8.2090
  16    558  8.7974
  18    526  9.3326
  24    469 10.4669 
  40    399 12.3032

To determine the best number of processors for your problem proceed as follows :

Production runs : This is easy :


next up previous contents
Next:  Testing the executable Up: Installation Previous:  Building from source   Contents
NMG, January 2005