The CASINO forum

Posted: **Thu Jan 30, 2014 10:53 am**

I am doing a calculation on a cell containing 96 hydrogen atoms and one twist. If I use complex_wf=F (the default) or complex_wf=T I get the same energy (as expected) and CASINO also runs at roughly the same speed.

If I now repeat this very same set of calculations, but using a larger 784 atoms cell, then the complex_wf=T calculation is about 20 times slower than the calculation with complex_wf=F.

Any idea why that might be the case ? I was expecting some slowdown, but a factor of 20 seems excessive.

Posted: **Fri Jan 31, 2014 3:46 pm**

Dear Dario,

I actually had same problem, especially that it becomes bold in larg systems, as you said.
I also expected some slowing down, according to CASINO manual :

" Note that complex arithmetic is somewhat slower than real arithmetic and hence complex wf should
be set to F wherever possible"

I therefore searched for the details of complex WF implementation in CASINO, but I couldn't find the reason of slowing down,
basically most of the details are based on real wave function :

"Throughout this section (Sec 13 Detailed information: the DMC method ) we assume for simplicity that the trial wave
function is real, though casino is perfectly capable of dealing with complex wave functions. "

Yours,
Sam

Posted: **Fri Jan 31, 2014 11:26 pm**

Amazingly enough I've never run a system that big with complex wave functions, so I have no practical sense of this. Neil's probably your man as he implemented most of it. - maybe he's done some bigger calculations..

As you obviously know - in very simplistic terms - complex multiplication is at least four times slower than real multiplication, as Sam said:

REAL
2x2=4

COMPLEX
2x2=(2,0)x(2,0) = (2x2 + 0x0) + (0x2 + 2x0)i = 4

I have a vague memory that in practical implementations on computers the slowdown can be even greater using single precision (you've got sp_blips set to T, right, to save memory?) when compared to double precision. Note this goes into the pre-factor, not the scaling..

However, to properly understand what is going on here, we need some further information about where the calculation is spending its time. Clearly some things that take a signification proportion of the overall CPU time are not done using complex arithmetic even if complex_wf=T (and these things seemingly dominate in your 1x1x1 case where the overall timing is not that dependent on complex_wf). Then, for example, with the N^3 scaling, the blip evaluation and various other things will become more important as the system size increases by a factor of - not quite? - 8 in your case - and the real/complex difference will be more pronounced.

So before I comment further, let's do a proper timing analysis. Did you run with timing_info=T? If not, can you do so? You might try doing it without a Jastrow factor too; I know you're wedded to using expensive f functions in large systems - are you using them in this case?

Once we have the actual numbers, we can talk further.

M.

Posted: **Mon Feb 03, 2014 4:39 am**

To quote Dario's offline response:

Here is the output file with the timings on. As you can see, blip evaluations are completely negligible, all the time is going in EVAL_DBAR (2/3 of it) and GET_ACCUM_ENERGIES (the remaining 1/3). Jastrow is also negligible.

So there's part of your answer. Your calculation is spending most of its time in the 'eval_dbar routine', that is, it's evaluating the cofactor matrices directly from the Slater matrix. Now - because this is very slow and scales badly with system size, CASINO is actually only supposed to do this at the start of the calculation and very occasionally thereafter (every 100,000 moves being the default). After all other moves, it uses the much cheaper and better scaling (though theoretically unstable) Sherman-Morrison-Woodbury formula.

Now you're not even doing 100,000 moves, so why does doing an LU decomposition to calculate a determinant once take so long in your case? Well, according to the output file you sent me - one reason could be that you're telling it to do this every 10 moves rather than every 100,000 by setting the obscure dbarrc keyword in input.

Why are you doing this, given that no current CASINO example does such a thing, and as far as I know hasn't done so for over a decade? Answer: because despite my telling you at least 300 times over the years to use modern input files appropriate to the version of the code you are using (and to use the nice error-detecting runqmc script, and etc.. etc.. etc..) you insist on recycling the same input file you've been using since about 1999 which just gets its number of electrons changed at the top.

For very early versions of CASINO - which I presume is where you got this from - dbarrc had a different meaning and the value you're using was more appropriate back then. This is one of the principal reasons why last week I felt compelled to annoy Neil - which I hate doing - by deprecating his favourite keywords (see http://www.vallico.net/casino-forum/vie ... p?f=4&t=60), all in an effort to make you switch to using modern input files! It's really not difficult - go on, you know you can do it.

Anyway, so if CASINO is (unnecessarily) spending 2/3 of its time doing LU decomposition, why is using complex numbers taking so much longer? Well I don't want to do a detailed analysis, but we can start by saying that in single precision (you're using sp_blips=T) complex multiplication can be up to seven times slower than real multiplication, that the complex stuff seems to require some extra matrix copying in CASINO itself, and that more generally zgetrf (the complex LAPACK routine) runs slower than dgetrf (the real LAPACK routine) which could be for multiple reasons.

So rather than go into all that - just delete the dbarrc line from your input file and rerun your tests. How does that affect the timings?

The get_accum_energies stuff (summing the energies over the cores for averaging) is the DMC rate limiting step in the case where the cores don't haven't enough work to do in moving the configs. In a full production run (which this isn't) you might consider using e.g. considerably more configs per node and you'll probably find get_accum_energies becomes a much smaller fraction of the total time. Still, I wouldn't expect it to take 30% of the time. I wonder if there's some other reason for that? Hmm.. Well, we can continue when we have the new timings..

M.

Posted: **Mon Feb 03, 2014 12:08 pm**

dbarrc=10:

Time taken in block : : : 3311.0801

no dbarrc set in input file:

Time taken in block : : : 178.8800

you are a genius!

Posted: **Mon Feb 03, 2014 12:36 pm**

you are a genius!

Ah, no, I think you mean:

you are a genius!

Posted: **Mon Feb 03, 2014 11:41 pm**

To conclude this thread, I should let everyone know that I've now modified the CASINO current beta to stop people in their tracks if they try this sort of thing again - see the latest DIARY entry below. To be honest, this should probably have been done before. It isn't really good practice to change the meaning of a keyword - our excuse is that it was a long time ago with CASINO in an early development phase and that dbarrc had not been present as an example for several years at the time..

M.

PS: Readers - understand that he's still not going to use modern input files. If only Moses had slightly larger tablets of stone, he would have written 'Thou shalt not multiply the number of moves per core by the number of cores!'

Code: Select all

---[v2.13.264]---
* Banned users from setting DBARRC input keyword to be less than the default of
  100000 (both in the runqmc script, and in CASINO's check_input_parameters
  routine).
  -- Mike Towler, 2014-02-03

  This keyword had a different meaning earlier in the life of CASINO where it
  was appropriate to set it to a value of e.g. 10. Re-using an old input file
  with such a setting now can cause the code to slow down by 1-2 orders of
  magnitude without the user necessarily understanding why (real world
  individuals of professorial rank have been observed doing this). It is
  therefore now forbidden to set DBARRC to a value lower than the default; if
  the user has a genuine reason for wishing to do this he/she may search for
  the error trap in the source code and comment it out.

The CASINO forum

complex_wf=T

complex_wf=T

Re: complex_wf=T

Re: complex_wf=T

Re: complex_wf=T

Re: complex_wf=T

Re: complex_wf=T

Re: complex_wf=T