OpenMP problem

fabio.dellasala

Hi, i'm running crystal OpenMP for some Hartree Fock periodic calculations.
When I run crystal OpenMP from a script (several times , sweeping over the basis set), I get
a strange behaviour.

when I run directly I get (e.g.)
....
CYC 5 ETOT(AU) -1.959844090722E+02 DETOT 1.01E-07 tst 1.86E-07 PX 1.17E-03
CYC 6 ETOT(AU) -1.959844090642E+02 DETOT 7.93E-09 tst 1.14E-08 PX 2.23E-04
CYC 7 ETOT(AU) -1.959844090650E+02 DETOT -7.24E-10 tst 1.19E-09 PX 2.23E-04

when I run from the script I get
CYC 5 ETOT(AU) -1.959844090722E+02 DETOT 1.01E-07 tst 1.86E-07 PX 1.17E-03
CYC 6 ETOT(AU) -1.959844090642E+02 DETOT 7.93E-09 tst 1.14E-08 PX 2.23E-04
CYC 7 ETOT(AU) -1.959844078015E+02 DETOT 1.26E-06 tst 1.19E-09 PX 2.23E-04

thus everyting is the same BUT the last cycle in which the energy increases !!!

In particular, it is the BIELET ZONE that differs

< ::: BIELET ZONE E-E 9.2053136238046E+02
< ::: TOTAL E-E 5.2408459522505E+01

::: BIELET ZONE E-E 9.2053136364389E+02
::: TOTAL E-E 5.2408460785929E+01

I SUSPECT A PROBLEM IN INITIALIZATION OF SOME OPENMP VARIABILE
BECAUSE THE VALUE OF THE FINAL TOTAL ENERGY ALSO DEPENDS ON HOW MANY TIMES
I RUN THE SCRIPT !!!

or which can be the explanation ?

input follows.

----------input---------------
Copper bulk
CRYSTAL
0 0 0
225
3.4
1
229 0. 0. 0.
LATVEC
40000
END
229 12
INPUT
19. 0 2 2 2 0 0
30.220000 355.770158 0
13.190000 70.865357 0
33.130000 233.891976 0
13.220000 53.947299 0
38.420000 -31.272165 0
13.260000 -2.741104 0
0 0 3 2. 1.
2.76963200000E+01 2.31132000000E-01
1.35053500000E+01 -6.56811000000E-01
8.81535500000E+00 -5.45875000000E-01
0 0 1 1. 1. *
6.000E+00 1.0
0 0 1 0. 1. *
2.000E+00 1.0
0 0 1 0. 1. *
4.000E-01 1.0
0 0 1 0. 1. *
2.000E-01 1.0
0 2 2 6. 1.
9.35043270000E+01 2.28290000000E-02
1.62854640000E+01 -1.00951300000E+00
0 2 1 0. 1. *
7.000E+00 1.0
0 2 1 0. 1. *
2.000E+00 1.0
0 2 1 0. 1. *
8.000E-01 1.0
0 3 4 10. 1.
4.12250060000E+01 4.46940000000E-02
1.23432500000E+01 2.12106000000E-01
4.20192000000E+00 4.53423000000E-01
1.37982500000E+00 5.33465000000E-01
0 3 1 0. 1. *
8.000E-01 1.0
0 3 1 0. 1. *
4.000E-01 1.0
99 0
END
TOLDEE
7
MAXCYCLE
25
SHRINK
24 48
DIIS
SLOSHING
PRTDIIS
HISTDIIS
15
TOLINTEG
9 9 9 15 45
ILASIZE
20000
FMIXING
45
FIXINDEX
END
GEOM
CRYSTAL
0 0 0
225
3.60
1
229 0. 0. 0.
LATVEC
40000
END

GiacomoAmbrogio

Hi Fabio,
Could you please also send both outputs (from the OMP and non-OMP calculations)?
This is a strange behaviour. In the meantime, I will run some tests.

fabio.dellasala

you can run the following script (on a single server or notebook)
with the input attached.
You will see that serial run are the same (as it must be)
whereas OMP runs differs (and it cannot be ! )
I can send you my outputs, which is your email ?

#!/bin/bash

cryser="/homex/atom/ATOMSOFT/CRYSTAL/NEWMPI2/bin/Linux-ifort_i64_OMP/dev/crystal"
echo "----------RUN SERIAL----------"
if [ ! -e inputhfser.out.1 ]; then
$cryser < inputhf.d12 | tee inputhfser.out.1
fi

if [ ! -e inputhfser.out.2 ]; then
$cryser < inputhf.d12 | tee inputhfser.out.2
fi

diff inputhfser.out.1 inputhfser.out.2 | grep -v TTT
if [ ! -e inputhfserNC.out.1 ]; then
$cryser < inputhfNC.d12 | tee inputhfserNC.out.1
fi

if [ ! -e inputhfser.out.3 ]; then
$cryser < inputhf.d12 | tee inputhfser.out.3
fi

diff inputhfser.out.3 inputhfser.out.1 | grep -v TTT

cry="/homex/atom/ATOMSOFT/CRYSTAL/OMPH/bin/Linux-ifort_i64_OMP/dev/crystalOMP"
export OMP_NUM_THREADS=16

$cry < inputhfnodiis.d12 | tee inputhf.out.1
$cry < inputhfnodiis.d12 | tee inputhf.out.2
exit
echo "------------RUN DIRECT------------"
if [ ! -e inputhf.out.1 ]; then
$cry < inputhf.d12 | tee inputhf.out.1
fi

if [ ! -e inputhf.out.2 ]; then
$cry < inputhf.d12 | tee inputhf.out.2
fi

echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
grep DETOT inputhf.out.1 >de1.dat
grep DETOT inputhf.out.2 >de2.dat
diff de1.dat de2.dat

if [ ! -e inputhfNC.out.1 ]; then
$cry < inputhfNC.d12 | tee inputhfNC.out.1
fi

if [ ! -e inputhf.out.3 ]; then
$cry < inputhf.d12 | tee inputhf.out.3
fi

echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
grep DETOT inputhf.out.3 >de3.dat

diff de1.dat de3.dat

echo "=====================run with script===================="

if [ ! -e inputhf.out.O1 ]; then
runcry23OMP 16 inputhf
mv inputhf.out inputhf.out.O1
fi

if [ ! -e inputhf.out.O2 ]; then
runcry23OMP 16 inputhf
mv inputhf.out inputhf.out.O2
fi

grep DETOT inputhf.out.O1 >deO1.dat
grep DETOT inputhf.out.O2 >deO2.dat

echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
diff deO1.dat deO2.dat

if [ ! -e inputhfNC.out.O1 ]; then
runcry23OMP 16 inputhfNC
mv inputhfNC.out inputhfNC.out.O1
fi

if [ ! -e inputhf.out.O3 ]; then
runcry23OMP 16 inputhf
mv inputhf.out inputhf.out.O3
fi

grep DETOT inputhf.out.O3 >deO3.dat
echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
diff deO1.dat deO3.dat

with
inputhf.d12
lithium bcc
CRYSTAL
0 0 0
229
3.25
1
3 0. 0. 0.
LATVEC
20000
END
3 8
0 0 6 2. 1.
6.26926280100E+03 2.05409688260E-04
9.40316124310E+02 1.59165540890E-03
2.14221075280E+02 8.28698297070E-03
6.07598401840E+01 3.38563742490E-02
1.99151520320E+01 1.11032258760E-01
7.31715097970E+00 2.74493833290E-01
0 0 2 1. 1.
2.97246742160E+00 2.37924564110E-01
1.26398523140E+00 3.07654119240E-01
0 0 1 0. 1. *
5E-01 1.0
0 0 1 0. 1. *
2E-01 1.0
0 0 1 0. 1. *
7E-02 1.0
0 2 1 0. 1. *
5E+00 1.0
0 2 1 0. 1. *
5E-01 1.0
0 2 1 0. 1. *
2E-01 1.0
99 0
END
TOLDEE
7
MAXCYCLE
25
SHRINK
24 48
SLOSHING
PRTDIIS
HISTDIIS
15
TOLINTEG
9 9 9 15 45
FMIXING
45
FIXINDEX
END
GEOM
CRYSTAL
0 0 0
229
3.443000000
1
3 0. 0. 0.
LATVEC
20000
END

and
inputhfNC.d12

lithium bcc
CRYSTAL
0 0 0
229
3.25
1
3 0. 0. 0.
LATVEC
20000
END
3 8
0 0 6 2. 1.
6.26926280100E+03 2.05409688260E-04
9.40316124310E+02 1.59165540890E-03
2.14221075280E+02 8.28698297070E-03
6.07598401840E+01 3.38563742490E-02
1.99151520320E+01 1.11032258760E-01
7.31715097970E+00 2.74493833290E-01
0 0 2 1. 1.
2.97246742160E+00 2.37924564110E-01
1.26398523140E+00 3.07654119240E-01
0 0 1 0. 1. *
9E-01 1.0
0 0 1 0. 1. *
2E-01 1.0
0 0 1 0. 1. *
5E-02 1.0
0 2 1 0. 1. *
1E+01 1.0
0 2 1 0. 1. *
7E-01 1.0
0 2 1 0. 1. *
2E-01 1.0
99 0
END
TOLDEE
7
MAXCYCLE
25
SHRINK
24 48
SLOSHING
PRTDIIS
HISTDIIS
15
TOLINTEG
9 9 9 15 45
FMIXING
45
FIXINDEX
END
GEOM
CRYSTAL
0 0 0
229
3.443000000
1
3 0. 0. 0.
LATVEC
20000
END

GiacomoAmbrogio

Hi Fabio,
Regarding your initial question, I ran the public OpenMP version of the code (using 4 threads) multiple times with your input, and I consistently obtain the same results across runs. The SCF procedure converges smoothly in 6 cycles, as shown below:

$ grep DETOT input_4th.out
 CYC   0 ETOT(AU) -1.959786174741E+02 DETOT -1.96E+02 tst  0.00E+00 PX  1.00E+00
 CYC   1 ETOT(AU) -1.959830545289E+02 DETOT -4.44E-03 tst  0.00E+00 PX  1.00E+00
 CYC   2 ETOT(AU) -1.959842257372E+02 DETOT -1.17E-03 tst  5.19E-04 PX  2.32E-02
 CYC   3 ETOT(AU) -1.959843829641E+02 DETOT -1.57E-04 tst  6.44E-04 PX  5.14E-02
 CYC   4 ETOT(AU) -1.959844090345E+02 DETOT -2.61E-05 tst  4.59E-06 PX  4.09E-03
 CYC   5 ETOT(AU) -1.959844090249E+02 DETOT  9.60E-09 tst  1.92E-07 PX  1.01E-03
 CYC   6 ETOT(AU) -1.959844090704E+02 DETOT -4.56E-08 tst  8.72E-09 PX  1.01E-03
$ grep DETOT input_4th_2.out
 CYC   0 ETOT(AU) -1.959786174741E+02 DETOT -1.96E+02 tst  0.00E+00 PX  1.00E+00
 CYC   1 ETOT(AU) -1.959830545289E+02 DETOT -4.44E-03 tst  0.00E+00 PX  1.00E+00
 CYC   2 ETOT(AU) -1.959842257372E+02 DETOT -1.17E-03 tst  5.19E-04 PX  2.32E-02
 CYC   3 ETOT(AU) -1.959843829641E+02 DETOT -1.57E-04 tst  6.44E-04 PX  5.14E-02
 CYC   4 ETOT(AU) -1.959844090345E+02 DETOT -2.61E-05 tst  4.59E-06 PX  4.09E-03
 CYC   5 ETOT(AU) -1.959844090249E+02 DETOT  9.60E-09 tst  1.92E-07 PX  1.01E-03
 CYC   6 ETOT(AU) -1.959844090704E+02 DETOT -4.56E-08 tst  8.72E-09 PX  1.01E-03
$ grep DETOT input_4th_3.out
 CYC   0 ETOT(AU) -1.959786174741E+02 DETOT -1.96E+02 tst  0.00E+00 PX  1.00E+00
 CYC   1 ETOT(AU) -1.959830545289E+02 DETOT -4.44E-03 tst  0.00E+00 PX  1.00E+00
 CYC   2 ETOT(AU) -1.959842257372E+02 DETOT -1.17E-03 tst  5.19E-04 PX  2.32E-02
 CYC   3 ETOT(AU) -1.959843829641E+02 DETOT -1.57E-04 tst  6.44E-04 PX  5.14E-02
 CYC   4 ETOT(AU) -1.959844090345E+02 DETOT -2.61E-05 tst  4.59E-06 PX  4.09E-03
 CYC   5 ETOT(AU) -1.959844090249E+02 DETOT  9.60E-09 tst  1.92E-07 PX  1.01E-03
 CYC   6 ETOT(AU) -1.959844090704E+02 DETOT -4.56E-08 tst  8.72E-09 PX  1.01E-03
$ grep DETOT input_4th_4.out
 CYC   0 ETOT(AU) -1.959786174741E+02 DETOT -1.96E+02 tst  0.00E+00 PX  1.00E+00
 CYC   1 ETOT(AU) -1.959830545289E+02 DETOT -4.44E-03 tst  0.00E+00 PX  1.00E+00
 CYC   2 ETOT(AU) -1.959842257372E+02 DETOT -1.17E-03 tst  5.19E-04 PX  2.32E-02
 CYC   3 ETOT(AU) -1.959843829641E+02 DETOT -1.57E-04 tst  6.44E-04 PX  5.14E-02
 CYC   4 ETOT(AU) -1.959844090345E+02 DETOT -2.61E-05 tst  4.59E-06 PX  4.09E-03
 CYC   5 ETOT(AU) -1.959844090249E+02 DETOT  9.60E-09 tst  1.92E-07 PX  1.01E-03
 CYC   6 ETOT(AU) -1.959844090704E+02 DETOT -4.56E-08 tst  8.72E-09 PX  1.01E-03

I suspect that the behavior you're observing might be related to differences in the OpenMP and/or MKL library versions used in your environment. In my experience, such differences can influence numerical results to some extent (especially in delicate systems like metals treated with HF, where convergence can be sensitive to small perturbations or numerical noise).

fabio.dellasala

Hi, can you run the following (shortened and corrected) script ?
I'm using the latest version of libraries: the serial and MPI versions work perfectly, but the OpenMP
results are not reproducible.
As the error is likely to be uninitialized variables, it depends on the memory, thus in some cases it can work, but in general it fails, like in this script.

#!/bin/bash

cry="$CRY23_EXEDIR/dev/crystalOMP"

export OMP_NUM_THREADS=16

echo "------------RUN DIRECT------------"
if [ ! -e inputhf.out.1 ]; then
$cry < inputhf.d12 | tee inputhf.out.1
fi

if [ ! -e inputhf.out.2 ]; then
$cry < inputhf.d12 | tee inputhf.out.2
fi

echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
grep DETOT inputhf.out.1 >de1.dat
grep DETOT inputhf.out.2 >de2.dat
diff de1.dat de2.dat

if [ ! -e inputhfNC.out.1 ]; then
$cry < inputhfNC.d12 | tee inputhfNC.out.1
fi

if [ ! -e inputhf.out.3 ]; then
$cry < inputhf.d12 | tee inputhf.out.3
fi

echo "THIS MUST BE THE SAME TO ALL DIGITS, BUT IT IS NOT !"
grep DETOT inputhf.out.3 >de3.dat

diff de1.dat de3.dat

Aleks

Hi Fabio,
Do you obtain the same behaviour with any other system? Perhaps you can try running one of the well known test cases (from the test subset or tutorials) such as the classic MgO or urea, alternatively KMnF3 for a more complex and/or magnetic system. This way one can rule out the system itself and move to compilation issues, such as those that Giacomo mentioned.
Cheers,
Aleks

GiacomoAmbrogio

Hi Fabio,

After an in-depth investigation, I can confirm that the behavior you're observing is definitely not due to uninitialized variables. The most likely explanation is the use of non-deterministic summation in reduction operations, which is a known behavior when running floating-point operations with OpenMP parallelization.

In OpenMP, when large arrays or matrices are reduced (e.g., summing elements of the density or Fock matrices), the order of operations can change depending on how the threads are scheduled at runtime. Since floating-point addition is not associative, this can lead to slightly different results depending on the number of threads, their distribution, or the memory layout at runtime, even if the input is exactly the same.

To confirm this, I ran your script using the public OpenMP build, with both 16 threads and a single thread. Results are reported below.

In the parallel run, there are slight differences in the final energy, but the magnitude of the difference is negligible, well within the SCF convergence tolerance.
In the single-thread run (i.e., serial execution with OpenMP), the results are fully reproducible, with no differences at all.

As expected, the more threads you use, the larger these differences may become, although they typically remain very small.

Increasing the number of threads likely increases the occurence for these small numerical differences, although they typically remain insignificant in practice. For this reason, we recommend limiting the number of OpenMP threads to 4, and primarily use MPI for more extended parallelism, which generally offers a better balance between performance and numerical stability.

Additionally, it's good practice in our workflow to run each job in a clean, isolated SCRATCH folder, to avoid any unintended interference from leftover files or cached data. Your current script doesn’t do this, so we recommend updating it to clean or define a separate working directory for each run.

 Energies at each SCF Cycle     -     Copper bulk
 -----------------------------------------------------------------------------
                         OpenMP 16 th
 -----------------------------------------------------------------------------
                 run 1                 run 2                 diff
 CYC   0  -1.959786174741E+02   -1.959786174741E+02    0.0
 CYC   1  -1.959830545289E+02   -1.959830545289E+02    0.0
 CYC   2  -1.959842257372E+02   -1.959842257372E+02    0.0
 CYC   3  -1.959843727892E+02   -1.959843727892E+02    0.0
 CYC   4  -1.959844091796E+02   -1.959844091796E+02    0.0
 CYC   5  -1.959844090374E+02   -1.959844090374E+02    0.0
 CYC   6  -1.959844090603E+02   -1.959844077969E+02   -1.2634000086109154e-06
 CYC   7  -1.959844090705E+02   -1.959844090714E+02    9.000018508231733e-10
 CYC   8                        -1.959844090772E+02    6.700020094285719e-09
 CYC   9                        -1.959844090706E+02    1.000159954855917e-10

 -----------------------------------------------------------------------------
                         Open MP 1 th
 -----------------------------------------------------------------------------
                 run 1                 run 2                 diff
 CYC   0  -1.959786174741E+02   -1.959786174741E+02    0.0
 CYC   1  -1.959830545289E+02   -1.959830545289E+02    0.0
 CYC   2  -1.959842257372E+02   -1.959842257372E+02    0.0
 CYC   3  -1.959843727892E+02   -1.959843727892E+02    0.0
 CYC   4  -1.959844091796E+02   -1.959844091796E+02    0.0
 CYC   5  -1.959844090374E+02   -1.959844090374E+02    0.0
 CYC   6  -1.959844090603E+02   -1.959844090603E+02    0.0
 CYC   7  -1.959844090705E+02   -1.959844090705E+02    0.0

fabio.dellasala

Hi, many thx for your tests.

However:

i) I always used
KMP_DETERMINISTIC_REDUCTION="true"
and the problem still remains...
maybe crystalOMP is not complied with the required precision on floating operation.

ii) reduction is also used in MPI , but in MPI results are the same to ALL DIGITS.

iii) in many other cases I found error up to 10^-5 i.e. 100-1000 larger than any numerical threshold
on convergence, which cannot be due to roundoff error in reduction.

bye
fabio

GiacomoAmbrogio

Hi Fabio,
A few clarifications:

i) I'm aware that setting KMP_DETERMINISTIC_REDUCTION=true can help, but in practice, it doesn't guarantee reproducibility on its own. Even building crystalOMP with stricter floating-point flags like -fp-model precise (or equivalent) doesn't always lead to fully consistent results, at least not in my experience with CRYSTAL.

ii) Yes, MPI reductions are typically deterministic, as most implementations use pairwise summation or other stable schemes. OpenMP, on the other hand, can still introduce variability due to threading and compiler optimizations.

iii) If you're seeing discrepancies around 1e-5, it's plausible that small numerical differences from OpenMP reductions are enough to drive the SCF, especially in extremely delicate cases like metallic systems treated at the HF level, toward slightly different solutions.

At this point, I’ve investigated what I could from our end. If maximum reproducibility is essential, I strongly recommend sticking with MPI-only parallelism.

Have a great day,
Giacomo

OpenMP problem

< ::: BIELET ZONE E-E 9.2053136238046E+02 < ::: TOTAL E-E 5.2408459522505E+01

< ::: BIELET ZONE E-E 9.2053136238046E+02
< ::: TOTAL E-E 5.2408459522505E+01