To quote Dario's offline response:
Here is the output file with the timings on. As you can see, blip evaluations are completely negligible, all the time is going in EVAL_DBAR (2/3 of it) and GET_ACCUM_ENERGIES (the remaining 1/3). Jastrow is also negligible.
So there's part of your answer. Your calculation is spending most of its time in the 'eval_dbar routine', that is, it's evaluating the cofactor matrices directly from the Slater matrix. Now - because this is very slow and scales badly with system size, CASINO is actually only supposed to do this at the start of the calculation and very occasionally thereafter (every 100,000 moves being the default). After all other moves, it uses the much cheaper and better scaling (though theoretically unstable) Sherman-Morrison-Woodbury formula.
Now you're not even doing 100,000 moves, so why does doing an LU decomposition to calculate a determinant once take so long in your case? Well, according to the output file you sent me - one reason could be that
you're telling it to do this every 10 moves rather than every 100,000 by setting the obscure
dbarrc keyword in input.
Why are you doing this, given that no current CASINO example does such a thing, and as far as I know hasn't done so for over a decade? Answer: because despite my telling you at least 300 times over the years to use modern input files appropriate to the version of the code you are using (and to use the nice error-detecting runqmc script, and etc.. etc.. etc..) you insist on recycling the same input file you've been using since about 1999 which just gets its number of electrons changed at the top.

For very early versions of CASINO - which I presume is where you got this from -
dbarrc had a different meaning and the value you're using was more appropriate back then. This is one of the principal reasons why last week I felt compelled to annoy Neil - which I hate doing - by deprecating his favourite keywords (see
http://www.vallico.net/casino-forum/vie ... p?f=4&t=60), all in an effort to make you switch to using modern input files! It's really not difficult - go on, you know you can do it.
Anyway, so if CASINO is (unnecessarily) spending 2/3 of its time doing LU decomposition, why is using complex numbers taking so much longer? Well I don't want to do a detailed analysis, but we can start by saying that in single precision (you're using
sp_blips=T) complex multiplication can be up to seven times slower than real multiplication, that the complex stuff seems to require some extra matrix copying in CASINO itself, and that more generally zgetrf (the complex LAPACK routine) runs slower than dgetrf (the real LAPACK routine) which could be for multiple reasons.
So rather than go into all that - just delete the
dbarrc line from your input file and rerun your tests. How does that affect the timings?
The get_accum_energies stuff (summing the energies over the cores for averaging) is the DMC rate limiting step in the case where the cores don't haven't enough work to do in moving the configs. In a full production run (which this isn't) you might consider using e.g. considerably more configs per node and you'll probably find get_accum_energies becomes a much smaller fraction of the total time. Still, I wouldn't expect it to take 30% of the time. I wonder if there's some other reason for that? Hmm.. Well, we can continue when we have the new timings..
M.