Skip to content
  • Home
  • Recent
Collapse
Brand Logo
CRYSTAL23
Latest v1.0.1
Tutorials Try the Demo Get a License
Tutorials Try the Demo Get a License Instagram
  1. Home
  2. Technical Aspects
  3. Running CRYSTAL in Parallel
  4. OpenMP problem

OpenMP problem

Scheduled Pinned Locked Moved Running CRYSTAL in Parallel
9 Posts 3 Posters 137 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • fabio.dellasalaundefined Offline
    fabio.dellasalaundefined Offline
    fabio.dellasala
    wrote last edited by
    #1

    Hi, i'm running crystal OpenMP for some Hartree Fock periodic calculations.
    When I run crystal OpenMP from a script (several times , sweeping over the basis set), I get
    a strange behaviour.

    when I run directly I get (e.g.)
    ....
    CYC 5 ETOT(AU) -1.959844090722E+02 DETOT 1.01E-07 tst 1.86E-07 PX 1.17E-03
    CYC 6 ETOT(AU) -1.959844090642E+02 DETOT 7.93E-09 tst 1.14E-08 PX 2.23E-04
    CYC 7 ETOT(AU) -1.959844090650E+02 DETOT -7.24E-10 tst 1.19E-09 PX 2.23E-04

    when I run from the script I get
    CYC 5 ETOT(AU) -1.959844090722E+02 DETOT 1.01E-07 tst 1.86E-07 PX 1.17E-03
    CYC 6 ETOT(AU) -1.959844090642E+02 DETOT 7.93E-09 tst 1.14E-08 PX 2.23E-04
    CYC 7 ETOT(AU) -1.959844078015E+02 DETOT 1.26E-06 tst 1.19E-09 PX 2.23E-04

    thus everyting is the same BUT the last cycle in which the energy increases !!!

    In particular, it is the BIELET ZONE that differs

    < ::: BIELET ZONE E-E 9.2053136238046E+02
    < ::: TOTAL E-E 5.2408459522505E+01

    ::: BIELET ZONE E-E 9.2053136364389E+02
    ::: TOTAL E-E 5.2408460785929E+01

    I SUSPECT A PROBLEM IN INITIALIZATION OF SOME OPENMP VARIABILE
    BECAUSE THE VALUE OF THE FINAL TOTAL ENERGY ALSO DEPENDS ON HOW MANY TIMES
    I RUN THE SCRIPT !!!

    or which can be the explanation ?

    input follows.

    ----------input---------------
    Copper bulk
    CRYSTAL
    0 0 0
    225
    3.4
    1
    229 0. 0. 0.
    LATVEC
    40000
    END
    229 12
    INPUT
    19. 0 2 2 2 0 0
    30.220000 355.770158 0
    13.190000 70.865357 0
    33.130000 233.891976 0
    13.220000 53.947299 0
    38.420000 -31.272165 0
    13.260000 -2.741104 0
    0 0 3 2. 1.
    2.76963200000E+01 2.31132000000E-01
    1.35053500000E+01 -6.56811000000E-01
    8.81535500000E+00 -5.45875000000E-01
    0 0 1 1. 1. *
    6.000E+00 1.0
    0 0 1 0. 1. *
    2.000E+00 1.0
    0 0 1 0. 1. *
    4.000E-01 1.0
    0 0 1 0. 1. *
    2.000E-01 1.0
    0 2 2 6. 1.
    9.35043270000E+01 2.28290000000E-02
    1.62854640000E+01 -1.00951300000E+00
    0 2 1 0. 1. *
    7.000E+00 1.0
    0 2 1 0. 1. *
    2.000E+00 1.0
    0 2 1 0. 1. *
    8.000E-01 1.0
    0 3 4 10. 1.
    4.12250060000E+01 4.46940000000E-02
    1.23432500000E+01 2.12106000000E-01
    4.20192000000E+00 4.53423000000E-01
    1.37982500000E+00 5.33465000000E-01
    0 3 1 0. 1. *
    8.000E-01 1.0
    0 3 1 0. 1. *
    4.000E-01 1.0
    99 0
    END
    TOLDEE
    7
    MAXCYCLE
    25
    SHRINK
    24 48
    DIIS
    SLOSHING
    PRTDIIS
    HISTDIIS
    15
    TOLINTEG
    9 9 9 15 45
    ILASIZE
    20000
    FMIXING
    45
    FIXINDEX
    END
    GEOM
    CRYSTAL
    0 0 0
    225
    3.60
    1
    229 0. 0. 0.
    LATVEC
    40000
    END

    1 Reply Last reply
    0
    • GiacomoAmbrogioundefined Offline
      GiacomoAmbrogioundefined Offline
      GiacomoAmbrogio Developer
      wrote last edited by
      #2

      Hi Fabio,
      Could you please also send both outputs (from the OMP and non-OMP calculations)?
      This is a strange behaviour. In the meantime, I will run some tests.

      Giacomo Ambrogio, PhD Student
      Department of Chemistry - University of Torino
      V. Giuria 5, 10125 Torino (Italy)

      1 Reply Last reply
      0
      • fabio.dellasalaundefined Offline
        fabio.dellasalaundefined Offline
        fabio.dellasala
        wrote last edited by
        #3

        you can run the following script (on a single server or notebook)
        with the input attached.
        You will see that serial run are the same (as it must be)
        whereas OMP runs differs (and it cannot be ! )
        I can send you my outputs, which is your email ?

        #!/bin/bash

        cryser="/homex/atom/ATOMSOFT/CRYSTAL/NEWMPI2/bin/Linux-ifort_i64_OMP/dev/crystal"
        echo "----------RUN SERIAL----------"
        if [ ! -e inputhfser.out.1 ]; then
        $cryser < inputhf.d12 | tee inputhfser.out.1
        fi

        if [ ! -e inputhfser.out.2 ]; then
        $cryser < inputhf.d12 | tee inputhfser.out.2
        fi

        diff inputhfser.out.1 inputhfser.out.2 | grep -v TTT
        if [ ! -e inputhfserNC.out.1 ]; then
        $cryser < inputhfNC.d12 | tee inputhfserNC.out.1
        fi

        if [ ! -e inputhfser.out.3 ]; then
        $cryser < inputhf.d12 | tee inputhfser.out.3
        fi

        diff inputhfser.out.3 inputhfser.out.1 | grep -v TTT

        cry="/homex/atom/ATOMSOFT/CRYSTAL/OMPH/bin/Linux-ifort_i64_OMP/dev/crystalOMP"
        export OMP_NUM_THREADS=16

        $cry < inputhfnodiis.d12 | tee inputhf.out.1
        $cry < inputhfnodiis.d12 | tee inputhf.out.2
        exit
        echo "------------RUN DIRECT------------"
        if [ ! -e inputhf.out.1 ]; then
        $cry < inputhf.d12 | tee inputhf.out.1
        fi

        if [ ! -e inputhf.out.2 ]; then
        $cry < inputhf.d12 | tee inputhf.out.2
        fi

        echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
        grep DETOT inputhf.out.1 >de1.dat
        grep DETOT inputhf.out.2 >de2.dat
        diff de1.dat de2.dat

        if [ ! -e inputhfNC.out.1 ]; then
        $cry < inputhfNC.d12 | tee inputhfNC.out.1
        fi

        if [ ! -e inputhf.out.3 ]; then
        $cry < inputhf.d12 | tee inputhf.out.3
        fi

        echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
        grep DETOT inputhf.out.3 >de3.dat

        diff de1.dat de3.dat

        echo "=====================run with script===================="

        if [ ! -e inputhf.out.O1 ]; then
        runcry23OMP 16 inputhf
        mv inputhf.out inputhf.out.O1
        fi

        if [ ! -e inputhf.out.O2 ]; then
        runcry23OMP 16 inputhf
        mv inputhf.out inputhf.out.O2
        fi

        grep DETOT inputhf.out.O1 >deO1.dat
        grep DETOT inputhf.out.O2 >deO2.dat

        echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
        diff deO1.dat deO2.dat

        if [ ! -e inputhfNC.out.O1 ]; then
        runcry23OMP 16 inputhfNC
        mv inputhfNC.out inputhfNC.out.O1
        fi

        if [ ! -e inputhf.out.O3 ]; then
        runcry23OMP 16 inputhf
        mv inputhf.out inputhf.out.O3
        fi

        grep DETOT inputhf.out.O3 >deO3.dat
        echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
        diff deO1.dat deO3.dat

        with
        inputhf.d12
        lithium bcc
        CRYSTAL
        0 0 0
        229
        3.25
        1
        3 0. 0. 0.
        LATVEC
        20000
        END
        3 8
        0 0 6 2. 1.
        6.26926280100E+03 2.05409688260E-04
        9.40316124310E+02 1.59165540890E-03
        2.14221075280E+02 8.28698297070E-03
        6.07598401840E+01 3.38563742490E-02
        1.99151520320E+01 1.11032258760E-01
        7.31715097970E+00 2.74493833290E-01
        0 0 2 1. 1.
        2.97246742160E+00 2.37924564110E-01
        1.26398523140E+00 3.07654119240E-01
        0 0 1 0. 1. *
        5E-01 1.0
        0 0 1 0. 1. *
        2E-01 1.0
        0 0 1 0. 1. *
        7E-02 1.0
        0 2 1 0. 1. *
        5E+00 1.0
        0 2 1 0. 1. *
        5E-01 1.0
        0 2 1 0. 1. *
        2E-01 1.0
        99 0
        END
        TOLDEE
        7
        MAXCYCLE
        25
        SHRINK
        24 48
        SLOSHING
        PRTDIIS
        HISTDIIS
        15
        TOLINTEG
        9 9 9 15 45
        FMIXING
        45
        FIXINDEX
        END
        GEOM
        CRYSTAL
        0 0 0
        229
        3.443000000
        1
        3 0. 0. 0.
        LATVEC
        20000
        END

        and
        inputhfNC.d12

        lithium bcc
        CRYSTAL
        0 0 0
        229
        3.25
        1
        3 0. 0. 0.
        LATVEC
        20000
        END
        3 8
        0 0 6 2. 1.
        6.26926280100E+03 2.05409688260E-04
        9.40316124310E+02 1.59165540890E-03
        2.14221075280E+02 8.28698297070E-03
        6.07598401840E+01 3.38563742490E-02
        1.99151520320E+01 1.11032258760E-01
        7.31715097970E+00 2.74493833290E-01
        0 0 2 1. 1.
        2.97246742160E+00 2.37924564110E-01
        1.26398523140E+00 3.07654119240E-01
        0 0 1 0. 1. *
        9E-01 1.0
        0 0 1 0. 1. *
        2E-01 1.0
        0 0 1 0. 1. *
        5E-02 1.0
        0 2 1 0. 1. *
        1E+01 1.0
        0 2 1 0. 1. *
        7E-01 1.0
        0 2 1 0. 1. *
        2E-01 1.0
        99 0
        END
        TOLDEE
        7
        MAXCYCLE
        25
        SHRINK
        24 48
        SLOSHING
        PRTDIIS
        HISTDIIS
        15
        TOLINTEG
        9 9 9 15 45
        FMIXING
        45
        FIXINDEX
        END
        GEOM
        CRYSTAL
        0 0 0
        229
        3.443000000
        1
        3 0. 0. 0.
        LATVEC
        20000
        END

        1 Reply Last reply
        0
        • GiacomoAmbrogioundefined Offline
          GiacomoAmbrogioundefined Offline
          GiacomoAmbrogio Developer
          wrote last edited by GiacomoAmbrogio
          #4

          Hi Fabio,
          Regarding your initial question, I ran the public OpenMP version of the code (using 4 threads) multiple times with your input, and I consistently obtain the same results across runs. The SCF procedure converges smoothly in 6 cycles, as shown below:

          $ grep DETOT input_4th.out
           CYC   0 ETOT(AU) -1.959786174741E+02 DETOT -1.96E+02 tst  0.00E+00 PX  1.00E+00
           CYC   1 ETOT(AU) -1.959830545289E+02 DETOT -4.44E-03 tst  0.00E+00 PX  1.00E+00
           CYC   2 ETOT(AU) -1.959842257372E+02 DETOT -1.17E-03 tst  5.19E-04 PX  2.32E-02
           CYC   3 ETOT(AU) -1.959843829641E+02 DETOT -1.57E-04 tst  6.44E-04 PX  5.14E-02
           CYC   4 ETOT(AU) -1.959844090345E+02 DETOT -2.61E-05 tst  4.59E-06 PX  4.09E-03
           CYC   5 ETOT(AU) -1.959844090249E+02 DETOT  9.60E-09 tst  1.92E-07 PX  1.01E-03
           CYC   6 ETOT(AU) -1.959844090704E+02 DETOT -4.56E-08 tst  8.72E-09 PX  1.01E-03
          $ grep DETOT input_4th_2.out
           CYC   0 ETOT(AU) -1.959786174741E+02 DETOT -1.96E+02 tst  0.00E+00 PX  1.00E+00
           CYC   1 ETOT(AU) -1.959830545289E+02 DETOT -4.44E-03 tst  0.00E+00 PX  1.00E+00
           CYC   2 ETOT(AU) -1.959842257372E+02 DETOT -1.17E-03 tst  5.19E-04 PX  2.32E-02
           CYC   3 ETOT(AU) -1.959843829641E+02 DETOT -1.57E-04 tst  6.44E-04 PX  5.14E-02
           CYC   4 ETOT(AU) -1.959844090345E+02 DETOT -2.61E-05 tst  4.59E-06 PX  4.09E-03
           CYC   5 ETOT(AU) -1.959844090249E+02 DETOT  9.60E-09 tst  1.92E-07 PX  1.01E-03
           CYC   6 ETOT(AU) -1.959844090704E+02 DETOT -4.56E-08 tst  8.72E-09 PX  1.01E-03
          $ grep DETOT input_4th_3.out
           CYC   0 ETOT(AU) -1.959786174741E+02 DETOT -1.96E+02 tst  0.00E+00 PX  1.00E+00
           CYC   1 ETOT(AU) -1.959830545289E+02 DETOT -4.44E-03 tst  0.00E+00 PX  1.00E+00
           CYC   2 ETOT(AU) -1.959842257372E+02 DETOT -1.17E-03 tst  5.19E-04 PX  2.32E-02
           CYC   3 ETOT(AU) -1.959843829641E+02 DETOT -1.57E-04 tst  6.44E-04 PX  5.14E-02
           CYC   4 ETOT(AU) -1.959844090345E+02 DETOT -2.61E-05 tst  4.59E-06 PX  4.09E-03
           CYC   5 ETOT(AU) -1.959844090249E+02 DETOT  9.60E-09 tst  1.92E-07 PX  1.01E-03
           CYC   6 ETOT(AU) -1.959844090704E+02 DETOT -4.56E-08 tst  8.72E-09 PX  1.01E-03
          $ grep DETOT input_4th_4.out
           CYC   0 ETOT(AU) -1.959786174741E+02 DETOT -1.96E+02 tst  0.00E+00 PX  1.00E+00
           CYC   1 ETOT(AU) -1.959830545289E+02 DETOT -4.44E-03 tst  0.00E+00 PX  1.00E+00
           CYC   2 ETOT(AU) -1.959842257372E+02 DETOT -1.17E-03 tst  5.19E-04 PX  2.32E-02
           CYC   3 ETOT(AU) -1.959843829641E+02 DETOT -1.57E-04 tst  6.44E-04 PX  5.14E-02
           CYC   4 ETOT(AU) -1.959844090345E+02 DETOT -2.61E-05 tst  4.59E-06 PX  4.09E-03
           CYC   5 ETOT(AU) -1.959844090249E+02 DETOT  9.60E-09 tst  1.92E-07 PX  1.01E-03
           CYC   6 ETOT(AU) -1.959844090704E+02 DETOT -4.56E-08 tst  8.72E-09 PX  1.01E-03
          

          I suspect that the behavior you're observing might be related to differences in the OpenMP and/or MKL library versions used in your environment. In my experience, such differences can influence numerical results to some extent (especially in delicate systems like metals treated with HF, where convergence can be sensitive to small perturbations or numerical noise).

          Giacomo Ambrogio, PhD Student
          Department of Chemistry - University of Torino
          V. Giuria 5, 10125 Torino (Italy)

          1 Reply Last reply
          2
          • fabio.dellasalaundefined Offline
            fabio.dellasalaundefined Offline
            fabio.dellasala
            wrote last edited by
            #5

            Hi, can you run the following (shortened and corrected) script ?
            I'm using the latest version of libraries: the serial and MPI versions work perfectly, but the OpenMP
            results are not reproducible.
            As the error is likely to be uninitialized variables, it depends on the memory, thus in some cases it can work, but in general it fails, like in this script.

            #!/bin/bash

            cry="$CRY23_EXEDIR/dev/crystalOMP"

            export OMP_NUM_THREADS=16

            echo "------------RUN DIRECT------------"
            if [ ! -e inputhf.out.1 ]; then
            $cry < inputhf.d12 | tee inputhf.out.1
            fi

            if [ ! -e inputhf.out.2 ]; then
            $cry < inputhf.d12 | tee inputhf.out.2
            fi

            echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
            grep DETOT inputhf.out.1 >de1.dat
            grep DETOT inputhf.out.2 >de2.dat
            diff de1.dat de2.dat

            if [ ! -e inputhfNC.out.1 ]; then
            $cry < inputhfNC.d12 | tee inputhfNC.out.1
            fi

            if [ ! -e inputhf.out.3 ]; then
            $cry < inputhf.d12 | tee inputhf.out.3
            fi

            echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
            grep DETOT inputhf.out.3 >de3.dat

            diff de1.dat de3.dat

            1 Reply Last reply
            0
            • Aleksundefined Offline
              Aleksundefined Offline
              Aleks Expert
              wrote last edited by
              #6

              Hi Fabio,
              Do you obtain the same behaviour with any other system? Perhaps you can try running one of the well known test cases (from the test subset or tutorials) such as the classic MgO or urea, alternatively KMnF3 for a more complex and/or magnetic system. This way one can rule out the system itself and move to compilation issues, such as those that Giacomo mentioned.
              Cheers,
              Aleks

              Aleksandar Živković, Scientific Assistant
              Department of Earth and Environmental Sciences, LMU Munich, Germany
              Crystal enthusiast

              1 Reply Last reply
              0
              • GiacomoAmbrogioundefined Offline
                GiacomoAmbrogioundefined Offline
                GiacomoAmbrogio Developer
                wrote last edited by GiacomoAmbrogio
                #7

                Hi Fabio,

                After an in-depth investigation, I can confirm that the behavior you're observing is definitely not due to uninitialized variables. The most likely explanation is the use of non-deterministic summation in reduction operations, which is a known behavior when running floating-point operations with OpenMP parallelization.

                In OpenMP, when large arrays or matrices are reduced (e.g., summing elements of the density or Fock matrices), the order of operations can change depending on how the threads are scheduled at runtime. Since floating-point addition is not associative, this can lead to slightly different results depending on the number of threads, their distribution, or the memory layout at runtime, even if the input is exactly the same.

                To confirm this, I ran your script using the public OpenMP build, with both 16 threads and a single thread. Results are reported below.

                • In the parallel run, there are slight differences in the final energy, but the magnitude of the difference is negligible, well within the SCF convergence tolerance.
                • In the single-thread run (i.e., serial execution with OpenMP), the results are fully reproducible, with no differences at all.

                As expected, the more threads you use, the larger these differences may become, although they typically remain very small.

                Increasing the number of threads likely increases the occurence for these small numerical differences, although they typically remain insignificant in practice. For this reason, we recommend limiting the number of OpenMP threads to 4, and primarily use MPI for more extended parallelism, which generally offers a better balance between performance and numerical stability.

                Additionally, it's good practice in our workflow to run each job in a clean, isolated SCRATCH folder, to avoid any unintended interference from leftover files or cached data. Your current script doesn’t do this, so we recommend updating it to clean or define a separate working directory for each run.

                 Energies at each SCF Cycle     -     Copper bulk
                 -----------------------------------------------------------------------------
                                         OpenMP 16 th
                 -----------------------------------------------------------------------------
                                 run 1                 run 2                 diff
                 CYC   0  -1.959786174741E+02   -1.959786174741E+02    0.0
                 CYC   1  -1.959830545289E+02   -1.959830545289E+02    0.0
                 CYC   2  -1.959842257372E+02   -1.959842257372E+02    0.0
                 CYC   3  -1.959843727892E+02   -1.959843727892E+02    0.0
                 CYC   4  -1.959844091796E+02   -1.959844091796E+02    0.0
                 CYC   5  -1.959844090374E+02   -1.959844090374E+02    0.0
                 CYC   6  -1.959844090603E+02   -1.959844077969E+02   -1.2634000086109154e-06
                 CYC   7  -1.959844090705E+02   -1.959844090714E+02    9.000018508231733e-10
                 CYC   8                        -1.959844090772E+02    6.700020094285719e-09
                 CYC   9                        -1.959844090706E+02    1.000159954855917e-10
                
                 -----------------------------------------------------------------------------
                                         Open MP 1 th
                 -----------------------------------------------------------------------------
                                 run 1                 run 2                 diff
                 CYC   0  -1.959786174741E+02   -1.959786174741E+02    0.0
                 CYC   1  -1.959830545289E+02   -1.959830545289E+02    0.0
                 CYC   2  -1.959842257372E+02   -1.959842257372E+02    0.0
                 CYC   3  -1.959843727892E+02   -1.959843727892E+02    0.0
                 CYC   4  -1.959844091796E+02   -1.959844091796E+02    0.0
                 CYC   5  -1.959844090374E+02   -1.959844090374E+02    0.0
                 CYC   6  -1.959844090603E+02   -1.959844090603E+02    0.0
                 CYC   7  -1.959844090705E+02   -1.959844090705E+02    0.0
                
                
                

                Giacomo Ambrogio, PhD Student
                Department of Chemistry - University of Torino
                V. Giuria 5, 10125 Torino (Italy)

                1 Reply Last reply
                2
                • fabio.dellasalaundefined Offline
                  fabio.dellasalaundefined Offline
                  fabio.dellasala
                  wrote last edited by
                  #8

                  Hi, many thx for your tests.

                  However:

                  i) I always used
                  KMP_DETERMINISTIC_REDUCTION="true"
                  and the problem still remains...
                  maybe crystalOMP is not complied with the required precision on floating operation.

                  ii) reduction is also used in MPI , but in MPI results are the same to ALL DIGITS.

                  iii) in many other cases I found error up to 10^-5 i.e. 100-1000 larger than any numerical threshold
                  on convergence, which cannot be due to roundoff error in reduction.

                  bye
                  fabio

                  1 Reply Last reply
                  0
                  • GiacomoAmbrogioundefined Offline
                    GiacomoAmbrogioundefined Offline
                    GiacomoAmbrogio Developer
                    wrote last edited by
                    #9

                    Hi Fabio,
                    A few clarifications:

                    i) I'm aware that setting KMP_DETERMINISTIC_REDUCTION=true can help, but in practice, it doesn't guarantee reproducibility on its own. Even building crystalOMP with stricter floating-point flags like -fp-model precise (or equivalent) doesn't always lead to fully consistent results, at least not in my experience with CRYSTAL.

                    ii) Yes, MPI reductions are typically deterministic, as most implementations use pairwise summation or other stable schemes. OpenMP, on the other hand, can still introduce variability due to threading and compiler optimizations.

                    iii) If you're seeing discrepancies around 1e-5, it's plausible that small numerical differences from OpenMP reductions are enough to drive the SCF, especially in extremely delicate cases like metallic systems treated at the HF level, toward slightly different solutions.

                    At this point, I’ve investigated what I could from our end. If maximum reproducibility is essential, I strongly recommend sticking with MPI-only parallelism.

                    Have a great day,
                    Giacomo

                    Giacomo Ambrogio, PhD Student
                    Department of Chemistry - University of Torino
                    V. Giuria 5, 10125 Torino (Italy)

                    1 Reply Last reply
                    2

                    Powered by Crystal Solutions
                    • Login

                    • Don't have an account? Register

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Home
                    • Recent