Parallel Qs

Next: Testing the executable Up: Installation Previous: Building from source Contents

Parallel Qs

Building and using the parallel version of QS requires some familiarity both with some details of your computing environment and with using QS via a script (and not in its automatic mode). I suspect that non-adventurous first-time users would rather prefer skipping this section and use the uniprocessor version of the program (in its automatic mode).

Version 1.3 of QS contains an implementation of a fine-grained parallelisation of the program based on MPI. Because this is a new feature and because you will have to built QS from source anyway, I have kept everything related to parallelisation in this section. The rest of the documentation should apply identically to the parallel version with the exception of the command line (or script) needed to run a parallel program on your cluster. This section also contains results from some tests that I performed on two Beowulf clusters. To cut a long story short (and to save you from possibly unnecessary efforts) :

You need a high-speed interconnect (Gigabit ethernet or better). You may get an acceptable scale-up even from a cluster based on a fast (100 Mbps) ethernet, but only for problems for which the target structure belongs to a high symmetry spacegroup.
Even with a Gigabit ethernet interconnect the scale-up will be far from ideal for low symmetry problems.
For most problems the optimum number of processors will be around 4 to 8 processors per minimisation (and so, you should not expect to submit a QS job to your 256-node brand-new-cluster with something like mpirun -np 256).

The bright side of these findings are :

Finding 1 has no bright side.
Finding's 2 bright side is that it is exactly these high-symmetry problems that make the uniprocessor version of QS unbearably slow.
Finding 3 is not as bad as it sounds : even with 4 processors per minimisation (and assuming that you want to perform 5 minimisations) you can easily keep 20 processors busy for a week or two.

Building it : I have tried to keep the parallelisation transparent with respect to the uniprocessor version of the program which means that in the absence of a suitable define (passed to the compiler) there should be no difference from the uniprocessor version. Compiling and linking the parallel version of QS depends on your cluster set-up, but ideally (and if you have a properly set-up LAM or MPICH implementation) you could simply say something like :

mpicc -DMPI -o Qs_MPI Qs.c -lsrfftw -lsfftw -lm

(which assumes that mpicc already known which compiler and optimisation flags are best for your cluster). The line shown above also implies that you already have the FFTW libraries ready to go (note that you do not need a parallel version of the FFTW libraries). Once you have an executable you can quickly test it using the files in the example/ directory of the distribution with something in the spirit of :

mpirun -np 4 Qs_MPI example.in

If all goes as planned, just copy the parallel version of the program somewhere in the users path with a suitable name and you're done (installation-wise, the rest of the steps are problem-specific).

Determination of 'best' number of processors : The best number of processors for any given problem depends mainly on two parameters : the number of unique reflections and the number of crystallographic symmetry operators (for the given space group). In general, the program will perform better as the number of unique reflections decreases and the symmetry increases. To get an idea of what to expect for your problem, I wasted some CPU time on a dual Athlon Beowulf with a Gigabit Ethernet to run a few tests with problems of different sizes. In all cases, I used short runs (200,000 steps each) and I assumed that the machines were otherwise idle. The results are shown below (wallclock times in seconds) :

   P422, 4692 refl.     P422, 9383 refl.     P422, 18698 refl.     P21, 9383 refl.
  
  Proc Wall Scale-up   Proc Wall Scale-up   Proc Wall Scale-up   Proc Wall Scale-up
   1   4909  1.0000     1   8421  1.0000     1  15396  1.0000     1   2386  1.0000
   2   2427  1.9636     2   4733  1.7792     2   8888  1.7322     2   1510  1.5801
   3   1850  2.6535     3   3464  2.4310     3   6311  2.4395     3   1218  1.9589
   4   1474  3.3303     4   2723  3.0925     4   4967  3.0996     4   1039  2.2964
   5   1236  3.9716     5   2315  3.6375     5   4297  3.5829     5    922  2.5878
   6   1069  4.5921     6   2021  4.1667     6   3696  4.1655     6    844  2.8270
   7    959  5.1188     7   1792  4.6992     7   3466  4.4420     7   2330  1.0240
   8    867  5.6620     8   1644  5.1222     8   3418  4.5043
   9    792  6.1982     9   1516  5.5547     9   3625  4.2471
  10    743  6.6069    10   1466  5.7442    10   3656  4.2111       
  12    661  7.4266    12   1553  5.4224
  14    598  8.2090
  16    558  8.7974
  18    526  9.3326
  24    469 10.4669 
  40    399 12.3032

To determine the best number of processors for your problem proceed as follows :

Run the program on a stand-alone machine using the automatic mode and stop it after in starts the actual minimisation (as described in the next sections of this document). The program will create a file named Qs_auto.in in its current directory.
Edit the file Qs_auto.in and change the lines saying
```
CYCLES              5
STEPS               10000000
```
to
```
CYCLES              1
STEPS               200000
```
(the actual number of steps you will find in Qs_auto.in depends on the number of molecules per asymmetric unit and may be different from the one shown above).
Copy the files Qs_auto.in, data.hkl and model.pdb (or model1.pdb, model2.pdb, etc) to suitably named directories (like 1proc/, 2proc/, etc). The idea is that you will be successively submitting otherwise identical QS jobs to increasing number of processors.
After the jobs finish, find the wallclock time recorded for each of the runs. This is written out by the program after each minimisation finishes (do a tail -50 <log file> and you should see it). Error messages of the type "FFTW failed to read the wisdom file ..." can safely be ignored.
Decide how many processors per minimisation you will use.

Production runs : This is easy :

Prepare directories like minim1/, minim2/, ... minim5/ and copy the Qs_auto.in, data.hkl and model.pdb (or model1.pdb, model2.pdb, etc) files in them.
Edit the Qs_auto.in file, change the number of steps to its original value (say, 10,000,000), keep the number of cycles to the value of 1, find the line saying 'SEED 147579' and change it to a different (integer) number. The values for SEED should be different for each of your minimisations, otherwise you will be performing a large number of identical minimisations.
Submit the jobs using the already determined 'best number of processors'.

Next: Testing the executable Up: Installation Previous: Building from source Contents

NMG, January 2005