PCrystal job stuck when run between several nodes

job314

Colleagues, this is likely not PCrystal but my HPC problem. When I run PCrystal between several nodes, it always stops at the same point below and just sits there. Job is clearly running if I squeue but no progress is made for many hours, output is not being updated since it reached the point below. I kill it, run it on a single node and it runs no problem. I SSH into the node, I see all the cores occupied with PCrystal nicely.

Where can it be a problem?

MAX NUMBER OF SCF CYCLES 200 CONVERGENCE ON DELTAP 10**-16
WEIGHT OF F(I) IN F(I+1) 30% CONVERGENCE ON ENERGY 10**-10
SHRINK. FACT.(MONKH.) 4 4 4 NUMBER OF K POINTS IN THE IBZ 30
SHRINKING FACTOR(GILAT NET) 4 NUMBER OF K POINTS(GILAT NET) 30

*** K POINTS COORDINATES (OBLIQUE COORDINATES IN UNITS OF IS = 4)
1-R( 0 0 0) 2-C( 1 0 0) 3-R( 2 0 0) 4-C( 0 1 0)
5-C( 1 1 0) 6-C( 2 1 0) 7-R( 0 2 0) 8-C( 1 2 0)
9-R( 2 2 0) 10-C( 0 0 1) 11-C( 1 0 1) 12-C( 2 0 1)
13-C( 3 0 1) 14-C( 0 1 1) 15-C( 1 1 1) 16-C( 2 1 1)
17-C( 3 1 1) 18-C( 0 2 1) 19-C( 1 2 1) 20-C( 2 2 1)
21-C( 3 2 1) 22-R( 0 0 2) 23-C( 1 0 2) 24-R( 2 0 2)
25-C( 0 1 2) 26-C( 1 1 2) 27-C( 2 1 2) 28-R( 0 2 2)
29-C( 1 2 2) 30-R( 2 2 2)

DIRECT LATTICE VECTORS COMPON. (A.U.) RECIP. LATTICE VECTORS COMPON. (A.U.)
X Y Z X Y Z
17.8044525 0.0000000 -0.1043147 0.3535604 -0.0000000 0.1127826
0.0000000 24.2698591 -0.0000000 -0.0000000 0.2588884 -0.0000000
-3.5260288 0.0000000 11.0536983 0.0033366 0.0000000 0.5694882

DISK SPACE FOR EIGENVECTORS (FTN 10) 44204368 REALS

SYMMETRY ADAPTION OF THE BLOCH FUNCTIONS ENABLED

#!/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=96
#SBATCH -t 100:00:00
#SBATCH -o vasp.out
#SBATCH -e vasp.err
#SBATCH -p ceres
#SBATCH --export=ALL
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -J /90daydata/urea_kinetics/MgNH4SO4/camB3LYP_pob2TZVP/Raman

module unload intel
module load crystal
mpirun -np 384 Pcrystal < INPUT

GiacomoAmbrogio

Hi job314,

A good reference is the how to run tutorial.

In your calculation, you're using 30 k-points. Depending on system symmetry, this generates a number of symmetry-adapted Bloch functions that are then distributed across processes. This is the main parallelization bottleneck in Pcrystal: ideally, you want to use a number of MPI processes equal or lower than the number of SABFs.

However, the behavior you describe seems unusual, even in cases where the number of MPI processes exceeds the number of SABFs. It might be an inter-node communication issue.

Have you tried running other MPI programs across multiple nodes to see if the problem persists?

Also, check how MPI processes are bound to physical cores. You can control this by adding a printing flag in your mpirun command (this is also explained in the tutorial above).

Let me know if you manage to get the bindings.

job314

Hi Giacomo, I am well aware how to run PCrystal. Been doing that for many years. For this particular HPC, however, I encounter the problem I described below. That also happens to VASP and I am looking on solutions.

job314

I tried running on 4 nodes, it got stuck, I changed to 2 nodes, and it is working. This uncertainty is what bothers me. I will try testing the binding and report here

job314

It only returns this when I test bindings

/var/spool/slurmd/job15381998/slurm_script: line 17: MPI_processes: No such file or directory
/var/spool/slurmd/job15381998/slurm_script: line 18: MPI_processes: No such file or directory

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --tasks-per-node=96
#SBATCH -t 140:00:00
#SBATCH -o vasp.out
#SBATCH -e vasp.err
#SBATCH -p ceres
#SBATCH --export=ALL
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -J /90daydata/urea_kinetics/struvite/camB3LYP_pobTZVP

module unload intel
module load crystal

mpirun --report-bindings -np <MPI_processes> /project/urea_kinetics/CRYSTAL/1.0.1/bin/Pcrystal
mpirun -print-rank-map -np <MPI_processes> /project/urea_kinetics/CRYSTAL/1.0.1/bin/Pcrystal

mpirun -np 192 Pcrystal < INPUT

GiacomoAmbrogio

Hi Job,
It seems there might be a bit of confusion regarding how to report MPI bindings, maybe we will update the tutorial to make it more clear.

To display binding information correctly, you’ll need to include the appropriate flag in your usual mpirun command. For example:

If you're using OpenMPI:

mpirun --report-bindings -np 192 Pcrystal

If you're using Intel MPI:

mpirun -print-rank-map -np 192 Pcrystal

To check which MPI implementation you're using, you can inspect the loaded modules in your environment, or simply run:

mpirun --version

If you're using a different MPI implementation, feel free to let me know. I'd be happy to help you find the right way to print the bindings.

job314

OK, here we go. It just is stuck, always the same position in the output

(ceres20-compute-46:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95)
(ceres24-compute-18:96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191)

export TMPDIR=/local/bgfs/jonas.baltrusaitis/15383115
export TMOUT=5400
export SINGULARITY_TMPDIR=/local/bgfs/jonas.baltrusaitis/15383115

MAX NUMBER OF SCF CYCLES 200 CONVERGENCE ON DELTAP 10**-20
WEIGHT OF F(I) IN F(I+1) 30% CONVERGENCE ON ENERGY 10**-10
SHRINK. FACT.(MONKH.) 6 6 6 NUMBER OF K POINTS IN THE IBZ 64
SHRINKING FACTOR(GILAT NET) 6 NUMBER OF K POINTS(GILAT NET) 64

*** K POINTS COORDINATES (OBLIQUE COORDINATES IN UNITS OF IS = 6)
1-R( 0 0 0) 2-C( 1 0 0) 3-C( 2 0 0) 4-R( 3 0 0)
5-C( 0 1 0) 6-C( 1 1 0) 7-C( 2 1 0) 8-C( 3 1 0)
9-C( 0 2 0) 10-C( 1 2 0) 11-C( 2 2 0) 12-C( 3 2 0)
13-R( 0 3 0) 14-C( 1 3 0) 15-C( 2 3 0) 16-R( 3 3 0)
17-C( 0 0 1) 18-C( 1 0 1) 19-C( 2 0 1) 20-C( 3 0 1)
21-C( 0 1 1) 22-C( 1 1 1) 23-C( 2 1 1) 24-C( 3 1 1)
25-C( 0 2 1) 26-C( 1 2 1) 27-C( 2 2 1) 28-C( 3 2 1)
29-C( 0 3 1) 30-C( 1 3 1) 31-C( 2 3 1) 32-C( 3 3 1)
33-C( 0 0 2) 34-C( 1 0 2) 35-C( 2 0 2) 36-C( 3 0 2)
37-C( 0 1 2) 38-C( 1 1 2) 39-C( 2 1 2) 40-C( 3 1 2)
41-C( 0 2 2) 42-C( 1 2 2) 43-C( 2 2 2) 44-C( 3 2 2)
45-C( 0 3 2) 46-C( 1 3 2) 47-C( 2 3 2) 48-C( 3 3 2)
49-R( 0 0 3) 50-C( 1 0 3) 51-C( 2 0 3) 52-R( 3 0 3)
53-C( 0 1 3) 54-C( 1 1 3) 55-C( 2 1 3) 56-C( 3 1 3)
57-C( 0 2 3) 58-C( 1 2 3) 59-C( 2 2 3) 60-C( 3 2 3)
61-R( 0 3 3) 62-C( 1 3 3) 63-C( 2 3 3) 64-R( 3 3 3)

DIRECT LATTICE VECTORS COMPON. (A.U.) RECIP. LATTICE VECTORS COMPON. (A.U.)
X Y Z X Y Z
13.1430453 0.0000000 0.0000000 0.4780616 0.0000000 0.0000000
0.0000000 11.6066979 0.0000000 0.0000000 0.5413413 0.0000000
0.0000000 0.0000000 21.1989478 0.0000000 0.0000000 0.2963914

DISK SPACE FOR EIGENVECTORS (FTN 10) 53868000 REALS

SYMMETRY ADAPTION OF THE BLOCH FUNCTIONS ENABLED
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT gordsh1 TELAPSE 186.18 TCPU 45.44