Occasional errors while simulating ion-effects
Posted: 07 Dec 2022, 06:25
Dear all,
When I run ion-effects simulation, sometimes I encounter the following errors:
*** Error in `/dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant': realloc(): invalid old size: 0x0000000001c48870 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f474)[0x2ac4b92f6474]
/lib64/libc.so.6(+0x84861)[0x2ac4b92fb861]
/lib64/libc.so.6(realloc+0x1d2)[0x2ac4b92fce12]
/dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant[0x68fd73]
(.....)
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac4b9299555]
/dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant[0x40a890]
======= Memory map: ========
00400000-01265000 r-xp 00000000 00:2e 6731606190 /dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant
01464000-0146c000 r--p 00e64000 00:2e 6731606190 /dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant
0146c000-014d1000 rw-p 00e6c000 00:2e 6731606190 /dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant
(.....)
There are several observations:
1) This occurs more easily when I run the job on more cores than less cores, say, if I want to run on 40 cores of a whole node. But when I reduce to 16 cores or less, it's less easily to occur.
With less cores (say, 10 or 16), there is a chance of seeing this error, i.e., when I submit the job to the cluster, it fails with the above error. And I submit the same job again, it succeeds.
2) This only happens to Pelegant, when I run the same job with serial elegant, there's no problem at all.
3) At first I was thinking it's a cluster problem, but I saw the same error when I submit the job to another cluster. Exactly the same phenomenon is observed.
4) I tried this script with different versions of ELEGANT: 2021.4, 2022.1, 2022.2, and the same error.
I have attached one of my scripts below. For my simulation, I'm using a single ILMATRIX element and one ion-effects element per turn. Could you help me look at it and see what settings I could use to avoid this error. The log files of the above error could be found in the 'logs' folder.
Many thanks,
Siwei
When I run ion-effects simulation, sometimes I encounter the following errors:
*** Error in `/dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant': realloc(): invalid old size: 0x0000000001c48870 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f474)[0x2ac4b92f6474]
/lib64/libc.so.6(+0x84861)[0x2ac4b92fb861]
/lib64/libc.so.6(realloc+0x1d2)[0x2ac4b92fce12]
/dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant[0x68fd73]
(.....)
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac4b9299555]
/dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant[0x40a890]
======= Memory map: ========
00400000-01265000 r-xp 00000000 00:2e 6731606190 /dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant
01464000-0146c000 r--p 00e64000 00:2e 6731606190 /dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant
0146c000-014d1000 rw-p 00e6c000 00:2e 6731606190 /dls/physics/wsw/ELEGANT2022/2022/bin/Pelegant
(.....)
There are several observations:
1) This occurs more easily when I run the job on more cores than less cores, say, if I want to run on 40 cores of a whole node. But when I reduce to 16 cores or less, it's less easily to occur.
With less cores (say, 10 or 16), there is a chance of seeing this error, i.e., when I submit the job to the cluster, it fails with the above error. And I submit the same job again, it succeeds.
2) This only happens to Pelegant, when I run the same job with serial elegant, there's no problem at all.
3) At first I was thinking it's a cluster problem, but I saw the same error when I submit the job to another cluster. Exactly the same phenomenon is observed.
4) I tried this script with different versions of ELEGANT: 2021.4, 2022.1, 2022.2, and the same error.
I have attached one of my scripts below. For my simulation, I'm using a single ILMATRIX element and one ion-effects element per turn. Could you help me look at it and see what settings I could use to avoid this error. The log files of the above error could be found in the 'logs' folder.
Many thanks,
Siwei