OK, the runqmc script is smart enough to know that it needs the value of NCORE because whoever wrote your arch file included some multi-line bash scripting to define the maximum wall time (MAX_WALLTIME) as follows:
Code: Select all
#-! *MAX_WALLTIME:
#-! case "&USER.QUEUE&"
#-! interactive) echo 30m ;;
#-! debug) echo 30m ;;
#-! premium) echo 12h ;;
#-! regular)
#-! if ((&NCORE&<12288)) ; then
#-! echo 24h
#-! elif ((&NCORE&<49152)) ; then
#-! echo 24h
#-! elif ((&NCORE&<98304)) ; then
#-! echo 24h
#-! else
#-! echo 12h
#-! fi ;;
#-! low) echo 12h ;;
#-! esac
You may think that you are defining NCORE specifically with '
runqmc -p 96' but
-p actually gives the 'number of MPI processes' NPROC -- as you can see from the verbose output -- and this is not necessarily the same as the number of reserved physical cores. Parallel machines are so much more complicated these days!
You may also think that you have given enough information for it to work out NCORE, but strictly speaking you haven't.. If you give it the number of MPI processes NPROC (a theoretical characteristic of the requested job) and CORES_PER_NODE_CLUSTER (a physical characteristic of the machine), you would also need to specify the number of MPI processes per node though the
--ppn flag. That said, there should probably be a default of assuming processes per node = cores per node and I'm not entirely sure why it isn't doing something like that here. The next time we have a rainy day I might plough through the hideously complicated scripts that implement this and try to figure out exactly what it's doing and why.
In the meantime, you can probably fix your problem by redefining the MAX_WALLTIME definition in terms of variables that the arch file is aware of, so e.g. change the NCORE-dependent bit of it to:
Code: Select all
#-! if ((&NNODE&<512)) ; then
#-! echo 24h
#-! elif ((&NNODE&<2048)) ; then
#-! echo 24h
#-! elif ((&NNODE&<4096)) ; then
#-! echo 24h
#-! else
#-! echo 12h
#-! fi ;;
noting that the if block can be simplified as all but one of the options end up defining MAX_WALLTIME as '24h'. As previously discussed, you also need to change CORE_PER_NODE_CLUSTER from 12 to the correct value of 24.
Let me know if that works..
M.