Next: Testing the executable
Up: Installation
Previous: Building from source
Contents
Building and using the parallel version of QS requires some familiarity
both with some details of your computing environment and with using
QS via a script (and not in its automatic mode). I suspect that
non-adventurous first-time users would rather prefer skipping this section
and use the uniprocessor version of the program (in its automatic mode).
Version 1.3 of QS contains an implementation of a fine-grained
parallelisation of the program based on MPI. Because this is a new feature
and because you will have to built QS from source anyway, I have kept everything related to
parallelisation in this section. The rest of the documentation should apply
identically to the parallel version with the exception of the command line
(or script) needed to run a parallel program on your cluster. This section
also contains results from some tests that I performed on two Beowulf clusters.
To cut a long story short (and to save you from possibly unnecessary efforts) :
- You need a high-speed interconnect (Gigabit ethernet or better). You
may get an acceptable scale-up even from a cluster based on a
fast (100 Mbps) ethernet, but only for problems for which the target
structure belongs to a high symmetry spacegroup.
- Even with a Gigabit ethernet interconnect the scale-up will be far
from ideal for low symmetry problems.
- For most problems the optimum number of processors will be around 4 to 8
processors per minimisation (and so, you should not expect to submit a QS job to your 256-node brand-new-cluster with something like mpirun -np
256).
The bright side of these findings are :
- Finding 1 has no bright side.
- Finding's 2 bright side is that it is exactly these high-symmetry
problems that make the uniprocessor version of QS unbearably slow.
- Finding 3 is not as bad as it sounds : even with 4 processors per minimisation
(and assuming that you want to perform 5 minimisations) you can easily
keep 20 processors busy for a week or two.
Building it : I have tried to keep the
parallelisation transparent with respect to the uniprocessor version of the
program which means that in the absence of a suitable define (passed to
the compiler) there should be no difference from the uniprocessor
version. Compiling and linking the parallel version of QS depends on your
cluster set-up, but ideally (and if you have a properly set-up LAM or MPICH
implementation) you could simply say something like :
mpicc -DMPI -o Qs_MPI Qs.c -lsrfftw -lsfftw -lm
(which assumes that mpicc already known which compiler and optimisation flags are
best for your cluster). The line shown above also implies that you already
have the FFTW libraries ready to go (note that you do not need a
parallel version of the FFTW libraries). Once you have an executable you can
quickly test it using the files in the example/ directory of the
distribution with something in the spirit of :
mpirun -np 4 Qs_MPI example.in
If all goes as planned, just copy the parallel version of the program
somewhere in the users path with a suitable name and you're done
(installation-wise, the rest of the steps are problem-specific).
Determination of 'best' number of processors : The best number of
processors for any given problem depends mainly on two parameters : the
number of unique reflections and the number of crystallographic symmetry
operators (for the given space group). In general, the program will perform
better as the number of unique reflections decreases and the symmetry
increases. To get an idea of what to expect for your problem, I wasted some
CPU time on a dual Athlon Beowulf with a Gigabit Ethernet to run a few tests
with problems of different sizes. In all cases, I used short runs (200,000
steps each) and I assumed that the machines were otherwise idle. The results
are shown below (wallclock times in seconds) :
P422, 4692 refl. P422, 9383 refl. P422, 18698 refl. P21, 9383 refl.
Proc Wall Scale-up Proc Wall Scale-up Proc Wall Scale-up Proc Wall Scale-up
1 4909 1.0000 1 8421 1.0000 1 15396 1.0000 1 2386 1.0000
2 2427 1.9636 2 4733 1.7792 2 8888 1.7322 2 1510 1.5801
3 1850 2.6535 3 3464 2.4310 3 6311 2.4395 3 1218 1.9589
4 1474 3.3303 4 2723 3.0925 4 4967 3.0996 4 1039 2.2964
5 1236 3.9716 5 2315 3.6375 5 4297 3.5829 5 922 2.5878
6 1069 4.5921 6 2021 4.1667 6 3696 4.1655 6 844 2.8270
7 959 5.1188 7 1792 4.6992 7 3466 4.4420 7 2330 1.0240
8 867 5.6620 8 1644 5.1222 8 3418 4.5043
9 792 6.1982 9 1516 5.5547 9 3625 4.2471
10 743 6.6069 10 1466 5.7442 10 3656 4.2111
12 661 7.4266 12 1553 5.4224
14 598 8.2090
16 558 8.7974
18 526 9.3326
24 469 10.4669
40 399 12.3032
To determine the best number of processors for your problem proceed as
follows :
- Run the program on a stand-alone machine using the automatic mode and
stop it after in starts the actual minimisation (as described in the
next sections of this document). The program will create a file named Qs_auto.in
in its current directory.
- Edit the file Qs_auto.in and change the lines saying
CYCLES 5
STEPS 10000000
to
CYCLES 1
STEPS 200000
(the actual number of steps you will find in Qs_auto.in depends on the number of
molecules per asymmetric unit and may be different from the one shown above).
- Copy the files Qs_auto.in, data.hkl and model.pdb (or model1.pdb,
model2.pdb, etc) to suitably named directories (like 1proc/, 2proc/, etc).
The idea is that you will be successively submitting otherwise identical QS jobs to
increasing number of processors.
- After the jobs finish, find the wallclock time recorded for each of the runs. This
is written out by the program after each minimisation finishes (do a tail -50 <log file>
and you should see it). Error messages of the type "FFTW failed to read the wisdom
file ..." can safely be ignored.
- Decide how many processors per minimisation you will use.
Production runs : This is easy :
- Prepare directories like minim1/, minim2/, ... minim5/ and
copy the Qs_auto.in, data.hkl and model.pdb (or model1.pdb,
model2.pdb, etc) files in them.
- Edit the Qs_auto.in file, change the number of steps to its original value (say,
10,000,000), keep the number of cycles to the value of 1, find the line saying
'SEED 147579' and change it to a different (integer) number. The values for SEED
should be different for each of your minimisations, otherwise you will be
performing a large number of identical minimisations.
- Submit the jobs using the already determined 'best number of processors'.
Next: Testing the executable
Up: Installation
Previous: Building from source
Contents
NMG, January 2005