next up previous contents
Next: Beyond automation Up: Qs Previous: Identifying promising solutions   Contents

What if automation fails ?

I wouldn't be surprised.

Seriously now. If the fully automatic run fails to give a promising solution and you still have CPU time to waste, there is one thing that I would definitely recommend that you try. Because this is to use the R-factor as a target function, and because the R-factor is a politically incorrect word to use these days in crystallography, this section contains some hand-waving. But the doer's bit first :

Create a clean directory and copy there the files data.hkl, model.pdb (or model1.pdb, etc), and the file Qs_auto.in (which was produced by QS in the fully automated run). Go to this new directory, edit the file Qs_auto.in, and change the line saying : TARGET          CORR-1, so that it reads TARGET          R-FACTOR and multiply by 0.75 the starting temperature (the number shown next to STARTING_TEMP).

Save the modified Qs_auto.in file, and try to run QS interactively by typing Qs Qs_auto.in (you should not use the -reso or -auto flags in this case). If the program stops with a message about memory requirements, check that you have enough memory, and then give Qs -force Qs_auto.in to run it again. Once it reaches to the stage of actually starting the first minimisation, stop it with CTRL-C and submit a proper batch job with :

batch
/usr/local/bin/Qs -force Qs_auto.in > LOG
<CTRL-D>

Here comes the hand-waving bit.

The correlation-based targets have repeatedly been shown to perform much better than the R-factor, and are considered to be the next best thing after a maximum-likelihood target. Indeed, the linear correlation coefficient is a beautiful statistic to have and to share, but under certain circumstances it may be cursed with deep false minima that are not present with the good-old R-factor. How could that be ? It is not uncommon to find macromolecular crystals which give data sets with uneven distribution of intensity in their various reflection classes (some types of reflections are systematically strong, some other weak). Now, any structure that can reproduce this strong-weak pattern will end-up having a beautiful value of the linear-correlation coefficient. Unfortunately, the great majority of the possible arrangements that can reproduce these intensity patterns are wrong solutions. Let me give an example : lets say that your crystals contain two molecules per asymmetric unit which are related by a simple translation of 1/2,0,0 along the crystallographic axes. Then, all reflections with h even will be strong, and all reflection with h odd will be weak (and there will be a strong pseudo-origin peak in the native Patterson function at 1/2,0,0). As it happens, all trial structures for which these two molecules have the same orientation (but any orientation) and the correct position in the cell, will reproduce this strong-weak pattern, and will give you a very nice value of the target function. QS will enter this false minimum and may happily spend the rest of its time in there. The possibility that the two molecules will slowly and in concert turn around to find their correct orientation is an improbability. So, you got it : you are in deep nice-looking minimum, changing either the orientation or the position of any of the two molecules gives a much-much-worse value of the target function (because you loose the weak-strong pattern), and the scene is set for QS to waste your time and effort.

Enter R-factor : The R-factor is a precision indicator. It couldn't care less whether all strong reflections are predicted strong, and the weak reflections weak. What matters to the R-factor is how precise our individual predictions are. An example will convince you. Suppose that for six reflections we have Fos= {5, 15, 10, 1600, 1000, 2000} and a trial structure that gives Fcs= {10, 2, 1, 3000, 700, 900} (notice that the two data sets are practically on the same scale). The linear correlation coefficient between the two sets is 0.708, but the R-factor stands at a hefty 62.4% (somewhat worse than what we consider ``random'' for non-centrosymmetric space groups).

One last thing worth mentioning is why nobody else discussed this before (to my knowledge). I believe the answer is because previous authors didn't have the chance to see this behaviour : In traditional molecular replacement you never ``see'' (in your calculations) all molecules before you are actually finished with molecular replacement. Worse, in traditional molecular replacement calculations it is highly unlikely to have all your molecules placed correctly but oriented incorrectly. Traditional methods due to their divide-and-conquer approach miss some combinations of parameters that can even make the R-factor look better than the correlation coefficient. Quoting Aaron Levenstein, ``Statistics are like bikinis : what they reveal is suggestive, but what they conceal is vital''.


next up previous contents
Next: Beyond automation Up: Qs Previous: Identifying promising solutions   Contents
NMG, January 2005