Out of Memory Error during CPKS Calculation for Large System (400+ atoms)
-
Hi,
I'm trying to run a CPKS calculation on a system with 400+ atoms, but the job keeps getting aborted due to out of memory (OOM) issues. I'm using 200 GB as the maximum memory limit per job (which is also the maximum allowed on our cluster), but the calculation still fails with OOM errors.
Is there anything I can do to reduce the memory load during the CPKS step (e.g., approximations, splitting into smaller jobs, etc.)?
I tried using LOWMEM, but it did not help.
INPUT.dat -
Hi,
The CPKS implementation in PCRYSTAL is parallelized according to a replicated-data strategy, which means that each process has a copy of the main arrays. Assume that each process (usually coinciding with each CPU core) allocates X GB of memory. If you run the job over n CPU cores in the same node, the total amount of required memory on the node will be nX.
So, one way to reduce the memory requirement of a replicated-data calculation is to lower the number of used CPU cores per node.
As an example, if you have a node with 128 CPU cores, try running the calculation on just 64 or 32 cores, making sure no other processes are executed on the node at the same time. This would effectively increase the available memory.
Hope this helps.
-
Hi Gryffindor,
The only way to improve memory management without compromising the calculation parameters, as Alessandro correctly pointed out, would be to run with a lower number of MPI processes.
If you don’t want to "waste" CPU cores in the process, the openMP version of the code shoud be compatible with CPHF/CPKS. You just need to export OMP_NUM_THREADS according to the number of MPI processes used, ensuring that:
CPU cores = MPI processes × OMP_NUM_THREADS
By doing so, you should be able to fully exploit all resources while optimizing memory usage more effectively. Some references can be found in the tutorial and on the CRYSTAL23 paper.
Hope this helps!
-
Hi Alessandro and Giacomo,
Thank you for the clear and helpful explanations!
I tried running the calculation with 32 cores, and it indeed helped with memory management. I’ll continue experimenting to optimize performance. Really appreciate your guidance and the references!