Dear all and dear casino developers in particular,
I have nasty and big calculations running. The blip file is 50G. I have 64G RAM per node. My calculations started off fine (at least, I cannot see anything weird), but then all of them died with memory issues. The weird thing is that they all got through a couple of blocks (the entire equilibration and about 4 to 8 stats blocks with 80 steps per block) before they died. In spite of the fact that all jobs run the same calculation (difference is only in the twist), they all died at different moments in execution time. The calculation at the gamma point ran fine, but that of course does not need complex wfns.
So, basically I have three questions:
a.) how can this (dying after several blocks have been executed) happen (this is a curiosity driven question and is actually of miner immediate urgency)
b.) should my runs be fine so far? i.e. can I restart them using the config.out from the last block, possibly using nodes which have more memory or using less blocks if that would help? Or do I have to fear that ther were some severe memory spillages in the earlier blocks too?
c.) if restarting is fine, does it matter that these nodes aren't haswell nodes while the others were (It will affect runtime of course, but would it harm the calculations in some way when restarting?).
Thanks for any help
Katharina
Calc. dying with memory issues after exec. of several blocks
-
- Posts: 84
- Joined: Tue Jun 17, 2014 6:50 am
-
- Posts: 117
- Joined: Fri May 31, 2013 10:42 am
- Location: Lancaster
- Contact:
Re: Calc. dying with memory issues after exec. of several bl
Dear Katharina,
Sorry to hear about the memory problems.
a) CASINO allocates configurations when they are born and deallocates them when they die, so the memory requirements fluctuate in time. The details of memory management are handled by the compiler and we have to hope that it is sensible. Perhaps this is behind the issue?
b) If CASINO has completed writing out a config.out file (you can check using format_configs) then it should be safe to restart from it.
c) Restarting on a different machine is fine so long as the endianness is the same. If the endianness is different then CASINO should protest when it tries to read the bwfn.data.bin file. If endianness is a problem then you can use format_configs to produce an unambiguous, formatted version of config.in, which you can then unformat on the new machine by running format_configs again.
Best wishes,
Neil.
Sorry to hear about the memory problems.
a) CASINO allocates configurations when they are born and deallocates them when they die, so the memory requirements fluctuate in time. The details of memory management are handled by the compiler and we have to hope that it is sensible. Perhaps this is behind the issue?
b) If CASINO has completed writing out a config.out file (you can check using format_configs) then it should be safe to restart from it.
c) Restarting on a different machine is fine so long as the endianness is the same. If the endianness is different then CASINO should protest when it tries to read the bwfn.data.bin file. If endianness is a problem then you can use format_configs to produce an unambiguous, formatted version of config.in, which you can then unformat on the new machine by running format_configs again.
Best wishes,
Neil.
-
- Posts: 84
- Joined: Tue Jun 17, 2014 6:50 am
Re: Calc. dying with memory issues after exec. of several bl
Hi Neil!
Thank you for your fast reply, so I will restart the calcs. But your answer to a.) does not sound reasonable to me: Typically (and this is also what I see for the gamma-point calculation) the number of configs is highest durig the first stage of equilibration, then goes down and then oscillates, but basically never reaches its initial high. So if I get through equilibration, why do I not get through the stats?
Thank you and all the best,
Katharina
Thank you for your fast reply, so I will restart the calcs. But your answer to a.) does not sound reasonable to me: Typically (and this is also what I see for the gamma-point calculation) the number of configs is highest durig the first stage of equilibration, then goes down and then oscillates, but basically never reaches its initial high. So if I get through equilibration, why do I not get through the stats?
Thank you and all the best,
Katharina
-
- Posts: 117
- Joined: Fri May 31, 2013 10:42 am
- Location: Lancaster
- Contact:
Re: Calc. dying with memory issues after exec. of several bl
Dear Katharina,
Thanks very much for reporting the issue - I agree there is a memory problem. It seems to have been introduced in patch 1d6d14ae, in which the call to dmc_annihilate_configs at the end of dmc_main was bypassed for the equilibration stage of a DMC calculation. This means that the data for the configuration population at the end of equilibration becomes detached and sits unused in memory during the subsequent statistics accumulation. If you start again with runtype=dmc_stats, you should have a bit more memory available. To fix the problem, replace
with
in dmc.f90.
At least this bug doesn't affect any results and, since nobody has noticed until just now, obviously hasn't caused too much inconvenience.
Thanks again,
Neil.
Thanks very much for reporting the issue - I agree there is a memory problem. It seems to have been introduced in patch 1d6d14ae, in which the call to dmc_annihilate_configs at the end of dmc_main was bypassed for the equilibration stage of a DMC calculation. This means that the data for the configuration population at the end of equilibration becomes detached and sits unused in memory during the subsequent statistics accumulation. If you start again with runtype=dmc_stats, you should have a bit more memory available. To fix the problem, replace
Code: Select all
! Clear configs.
if(iaccum.or.(isdmc.and..not.iaccum))then
call dmc_annihilate_configs
endif
Code: Select all
! Clear configs.
call dmc_annihilate_configs
At least this bug doesn't affect any results and, since nobody has noticed until just now, obviously hasn't caused too much inconvenience.
Thanks again,
Neil.
-
- Posts: 239
- Joined: Thu May 30, 2013 11:03 pm
- Location: Florence
- Contact:
Re: Calc. dying with memory issues after exec. of several bl
Well spotted! I've added Neil's fix to the main distribution available on the website.
M.
M.
-
- Posts: 84
- Joined: Tue Jun 17, 2014 6:50 am
Re: Calc. dying with memory issues after exec. of several bl
Hi Neil!
Ah, now this makes sense! I guess it is very bad luck to really run into this situation! Thanks for spotting the problem!
All the best,
Katharina
Ah, now this makes sense! I guess it is very bad luck to really run into this situation! Thanks for spotting the problem!
All the best,
Katharina