Skip to content
  • Home
  • Recent
Collapse
Brand Logo
CRYSTAL23
Latest v1.0.1
Tutorials Try the Demo Get a License
Tutorials Try the Demo Get a License Instagram
  1. Home
  2. Technical Aspects
  3. Running CRYSTAL in Parallel
  4. PCrystal job stuck when run between several nodes

PCrystal job stuck when run between several nodes

Scheduled Pinned Locked Moved Running CRYSTAL in Parallel
7 Posts 2 Posters 48 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • job314undefined Offline
    job314undefined Offline
    job314
    wrote last edited by
    #1

    Colleagues, this is likely not PCrystal but my HPC problem. When I run PCrystal between several nodes, it always stops at the same point below and just sits there. Job is clearly running if I squeue but no progress is made for many hours, output is not being updated since it reached the point below. I kill it, run it on a single node and it runs no problem. I SSH into the node, I see all the cores occupied with PCrystal nicely.

    Where can it be a problem?


    MAX NUMBER OF SCF CYCLES 200 CONVERGENCE ON DELTAP 10**-16
    WEIGHT OF F(I) IN F(I+1) 30% CONVERGENCE ON ENERGY 10**-10
    SHRINK. FACT.(MONKH.) 4 4 4 NUMBER OF K POINTS IN THE IBZ 30
    SHRINKING FACTOR(GILAT NET) 4 NUMBER OF K POINTS(GILAT NET) 30


    *** K POINTS COORDINATES (OBLIQUE COORDINATES IN UNITS OF IS = 4)
    1-R( 0 0 0) 2-C( 1 0 0) 3-R( 2 0 0) 4-C( 0 1 0)
    5-C( 1 1 0) 6-C( 2 1 0) 7-R( 0 2 0) 8-C( 1 2 0)
    9-R( 2 2 0) 10-C( 0 0 1) 11-C( 1 0 1) 12-C( 2 0 1)
    13-C( 3 0 1) 14-C( 0 1 1) 15-C( 1 1 1) 16-C( 2 1 1)
    17-C( 3 1 1) 18-C( 0 2 1) 19-C( 1 2 1) 20-C( 2 2 1)
    21-C( 3 2 1) 22-R( 0 0 2) 23-C( 1 0 2) 24-R( 2 0 2)
    25-C( 0 1 2) 26-C( 1 1 2) 27-C( 2 1 2) 28-R( 0 2 2)
    29-C( 1 2 2) 30-R( 2 2 2)

    DIRECT LATTICE VECTORS COMPON. (A.U.) RECIP. LATTICE VECTORS COMPON. (A.U.)
    X Y Z X Y Z
    17.8044525 0.0000000 -0.1043147 0.3535604 -0.0000000 0.1127826
    0.0000000 24.2698591 -0.0000000 -0.0000000 0.2588884 -0.0000000
    -3.5260288 0.0000000 11.0536983 0.0033366 0.0000000 0.5694882

    DISK SPACE FOR EIGENVECTORS (FTN 10) 44204368 REALS

    SYMMETRY ADAPTION OF THE BLOCH FUNCTIONS ENABLED

    #!/bin/bash
    #SBATCH --nodes=4
    #SBATCH --tasks-per-node=96
    #SBATCH -t 100:00:00
    #SBATCH -o vasp.out
    #SBATCH -e vasp.err
    #SBATCH -p ceres
    #SBATCH --export=ALL
    #SBATCH --mail-type=ALL
    #SBATCH [email protected]
    #SBATCH -J /90daydata/urea_kinetics/MgNH4SO4/camB3LYP_pob2TZVP/Raman

    module unload intel
    module load crystal
    mpirun -np 384 Pcrystal < INPUT

    1 Reply Last reply
    0
    • GiacomoAmbrogioundefined Offline
      GiacomoAmbrogioundefined Offline
      GiacomoAmbrogio Developer
      wrote last edited by
      #2

      Hi job314,

      A good reference is the how to run tutorial.

      In your calculation, you're using 30 k-points. Depending on system symmetry, this generates a number of symmetry-adapted Bloch functions that are then distributed across processes. This is the main parallelization bottleneck in Pcrystal: ideally, you want to use a number of MPI processes equal or lower than the number of SABFs.

      However, the behavior you describe seems unusual, even in cases where the number of MPI processes exceeds the number of SABFs. It might be an inter-node communication issue.

      Have you tried running other MPI programs across multiple nodes to see if the problem persists?

      Also, check how MPI processes are bound to physical cores. You can control this by adding a printing flag in your mpirun command (this is also explained in the tutorial above).

      Let me know if you manage to get the bindings.

      Giacomo Ambrogio, PhD Student
      Department of Chemistry - University of Torino
      V. Giuria 5, 10125 Torino (Italy)

      1 Reply Last reply
      0
      • job314undefined Offline
        job314undefined Offline
        job314
        wrote last edited by
        #3

        Hi Giacomo, I am well aware how to run PCrystal. Been doing that for many years. For this particular HPC, however, I encounter the problem I described below. That also happens to VASP and I am looking on solutions.

        1 Reply Last reply
        0
        • job314undefined Offline
          job314undefined Offline
          job314
          wrote last edited by
          #4

          I tried running on 4 nodes, it got stuck, I changed to 2 nodes, and it is working. This uncertainty is what bothers me. I will try testing the binding and report here

          1 Reply Last reply
          0
          • job314undefined Offline
            job314undefined Offline
            job314
            wrote last edited by
            #5

            It only returns this when I test bindings

            /var/spool/slurmd/job15381998/slurm_script: line 17: MPI_processes: No such file or directory
            /var/spool/slurmd/job15381998/slurm_script: line 18: MPI_processes: No such file or directory

            #!/bin/bash
            #SBATCH --nodes=2
            #SBATCH --tasks-per-node=96
            #SBATCH -t 140:00:00
            #SBATCH -o vasp.out
            #SBATCH -e vasp.err
            #SBATCH -p ceres
            #SBATCH --export=ALL
            #SBATCH --mail-type=ALL
            #SBATCH [email protected]
            #SBATCH -J /90daydata/urea_kinetics/struvite/camB3LYP_pobTZVP

            module unload intel
            module load crystal

            mpirun --report-bindings -np <MPI_processes> /project/urea_kinetics/CRYSTAL/1.0.1/bin/Pcrystal
            mpirun -print-rank-map -np <MPI_processes> /project/urea_kinetics/CRYSTAL/1.0.1/bin/Pcrystal

            mpirun -np 192 Pcrystal < INPUT

            1 Reply Last reply
            0
            • GiacomoAmbrogioundefined Offline
              GiacomoAmbrogioundefined Offline
              GiacomoAmbrogio Developer
              wrote last edited by
              #6

              Hi Job,
              It seems there might be a bit of confusion regarding how to report MPI bindings, maybe we will update the tutorial to make it more clear.

              To display binding information correctly, you’ll need to include the appropriate flag in your usual mpirun command. For example:

              • If you're using OpenMPI:
              mpirun --report-bindings -np 192 Pcrystal
              
              • If you're using Intel MPI:
              mpirun -print-rank-map -np 192 Pcrystal
              

              To check which MPI implementation you're using, you can inspect the loaded modules in your environment, or simply run:

              mpirun --version
              

              If you're using a different MPI implementation, feel free to let me know. I'd be happy to help you find the right way to print the bindings.

              Giacomo Ambrogio, PhD Student
              Department of Chemistry - University of Torino
              V. Giuria 5, 10125 Torino (Italy)

              1 Reply Last reply
              2
              • job314undefined Offline
                job314undefined Offline
                job314
                wrote last edited by
                #7

                OK, here we go. It just is stuck, always the same position in the output

                (ceres20-compute-46:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95)
                (ceres24-compute-18:96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191)

                export TMPDIR=/local/bgfs/jonas.baltrusaitis/15383115
                export TMOUT=5400
                export SINGULARITY_TMPDIR=/local/bgfs/jonas.baltrusaitis/15383115


                MAX NUMBER OF SCF CYCLES 200 CONVERGENCE ON DELTAP 10**-20
                WEIGHT OF F(I) IN F(I+1) 30% CONVERGENCE ON ENERGY 10**-10
                SHRINK. FACT.(MONKH.) 6 6 6 NUMBER OF K POINTS IN THE IBZ 64
                SHRINKING FACTOR(GILAT NET) 6 NUMBER OF K POINTS(GILAT NET) 64


                *** K POINTS COORDINATES (OBLIQUE COORDINATES IN UNITS OF IS = 6)
                1-R( 0 0 0) 2-C( 1 0 0) 3-C( 2 0 0) 4-R( 3 0 0)
                5-C( 0 1 0) 6-C( 1 1 0) 7-C( 2 1 0) 8-C( 3 1 0)
                9-C( 0 2 0) 10-C( 1 2 0) 11-C( 2 2 0) 12-C( 3 2 0)
                13-R( 0 3 0) 14-C( 1 3 0) 15-C( 2 3 0) 16-R( 3 3 0)
                17-C( 0 0 1) 18-C( 1 0 1) 19-C( 2 0 1) 20-C( 3 0 1)
                21-C( 0 1 1) 22-C( 1 1 1) 23-C( 2 1 1) 24-C( 3 1 1)
                25-C( 0 2 1) 26-C( 1 2 1) 27-C( 2 2 1) 28-C( 3 2 1)
                29-C( 0 3 1) 30-C( 1 3 1) 31-C( 2 3 1) 32-C( 3 3 1)
                33-C( 0 0 2) 34-C( 1 0 2) 35-C( 2 0 2) 36-C( 3 0 2)
                37-C( 0 1 2) 38-C( 1 1 2) 39-C( 2 1 2) 40-C( 3 1 2)
                41-C( 0 2 2) 42-C( 1 2 2) 43-C( 2 2 2) 44-C( 3 2 2)
                45-C( 0 3 2) 46-C( 1 3 2) 47-C( 2 3 2) 48-C( 3 3 2)
                49-R( 0 0 3) 50-C( 1 0 3) 51-C( 2 0 3) 52-R( 3 0 3)
                53-C( 0 1 3) 54-C( 1 1 3) 55-C( 2 1 3) 56-C( 3 1 3)
                57-C( 0 2 3) 58-C( 1 2 3) 59-C( 2 2 3) 60-C( 3 2 3)
                61-R( 0 3 3) 62-C( 1 3 3) 63-C( 2 3 3) 64-R( 3 3 3)

                DIRECT LATTICE VECTORS COMPON. (A.U.) RECIP. LATTICE VECTORS COMPON. (A.U.)
                X Y Z X Y Z
                13.1430453 0.0000000 0.0000000 0.4780616 0.0000000 0.0000000
                0.0000000 11.6066979 0.0000000 0.0000000 0.5413413 0.0000000
                0.0000000 0.0000000 21.1989478 0.0000000 0.0000000 0.2963914

                DISK SPACE FOR EIGENVECTORS (FTN 10) 53868000 REALS

                SYMMETRY ADAPTION OF THE BLOCH FUNCTIONS ENABLED
                TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT gordsh1 TELAPSE 186.18 TCPU 45.44

                1 Reply Last reply
                0

                Powered by Crystal Solutions
                • Login

                • Don't have an account? Register

                • Login or register to search.
                • First post
                  Last post
                0
                • Home
                • Recent