Terminated by SIGSEGVProgram trace-back, when generating a bunch of electrons

Moderators: cyao, michael_borland

soliday
Posts: 390
Joined: 28 May 2008, 09:15

Re: Terminated by SIGSEGVProgram trace-back, when generating a bunch of electrons

Post by soliday » 28 Oct 2020, 08:51

On RHEL.7 the mpich package from Redhat is version 3.0. That is the one in /usr/lib64/mpich. If you are using MPICH2 then you will want to use the Build-AOP-RPMs script to build Pelegant using the MPICH2 libraries.

As for NFS, it depends on how you are running this. If this is a cluster, then the cluster administrator will have to change it for you or tell you different drive that you could use that is built for parallel I/O. We use the lustre file system on our cluster. If however you are running it on your office workstation, then you may not be able to change it. In this case, I would try to write to the local disk and then transfer it to the NFS drive after it is done. If you must have Pelegant output to an NFS drive, then try one or both of the following options:

Code: Select all

&global_settings
mpi_io_force_file_sync=1,
&end
or

Code: Select all

&global_settings
usleep_mpi_io_kludge = 100
&end

jcytsai
Posts: 41
Joined: 01 Oct 2012, 20:18

Re: Terminated by SIGSEGVProgram trace-back, when generating a bunch of electrons

Post by jcytsai » 29 Oct 2020, 02:19

Dear Dr. Soliday,

Thank you for providing the options. Yes, I use MPICH2 and use Build-AOP-RPMs script to build Pelegant. I have elegant running in an office workstation. I tried both the options you provided (the first option takes much longer time than I expected) but both fail to generate complete *.bun file. What do you mean by local disk and file transfer to NFS after the simulation done? I am not quite familiar with the file I/O, but it seems there is some way this issue could still be resolved (re-install Pelegant? or compile from source?).

Thanks,
Cheng-Ying

soliday
Posts: 390
Joined: 28 May 2008, 09:15

Re: Terminated by SIGSEGVProgram trace-back, when generating a bunch of electrons

Post by soliday » 29 Oct 2020, 08:37

Does your office workstation have a local disk drive that you can write to? If you run "df" you will see all the mounted drives. Look for entries that start with /dev. On my computer I can write to the local disk drive in the /local directory.

Code: Select all

Filesystem                    1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_root-lv_local  426836348 343265088  83571260  81% /local
Can you open send me the top part of the .bun file. The header should be ASCII and can be read in a text editor. sddscheck said the header was corrupted. I'd like to see what is wrong with it.

jcytsai
Posts: 41
Joined: 01 Oct 2012, 20:18

Re: Terminated by SIGSEGVProgram trace-back, when generating a bunch of electrons

Post by jcytsai » 29 Oct 2020, 08:57

Dear Dr. Soliday,

Thanks for prompt reply. When "df", I get

Code: Select all

Filesystem               1K-blocks      Used  Available Use% Mounted on
/dev/mapper/centos-root 1898056188 108805484 1789250704   6% /
devtmpfs                 131761712         0  131761712   0% /dev
tmpfs                    131779028     24492  131754536   1% /dev/shm
tmpfs                    131779028    970504  130808524   1% /run
tmpfs                    131779028         0  131779028   0% /sys/fs/cgroup
/dev/nvme0n1p2             1038336    199936     838400  20% /boot
/dev/nvme0n1p1             1046516     11428    1035088   2% /boot/efi
/dev/sda1               7752282540     94240 7361470708   1% /data
The last one is an external drive.

The top few lines of *.bun file shows

Code: Select all

SDDS1
!# little-endian
&end
&parameter name=Step, description="Simulation step", type=long, &end
&parameter name=pCentral, symbol="p$bcen$n", units="m$be$nc", description="Reference beta*gamma", type=double, &end
&parameter name=Charge, units=C, description="Bunch charge before sampling", type=double, &end
&parameter name=Particles, description="Number of particles before sampling", type=long, &end
&parameter name=IDSlotsPerBunch, description="Number of particle ID slots reserved to a bunch", type=long, &end
&parameter name=SVNVersion, description="SVN version number", type=string, fixed_value=27408M, &end
&column name=x, units=m, type=double,  &end
&column name=xp, symbol=x', type=double,  &end
&column name=y, units=m, type=double,  &end
&column name=yp, symbol=y', type=double,  &end
&column name=t, units=s, type=double,  &end
&column name=p, units="m$be$nc", type=double,  &end
&column name=particleID, type=long,  &end
&data mode=binary, &end
Thanks,
Cheng-Ying

jcytsai
Posts: 41
Joined: 01 Oct 2012, 20:18

Re: Terminated by SIGSEGVProgram trace-back, when generating a bunch of electrons

Post by jcytsai » 06 Nov 2020, 00:54

Dear Dr. Soliday,

I had some attempts but still fail to resolve this issue. I was trying

Code: Select all

mount -o remount,noac /
with no success

Code: Select all

mount: / not mounted or bad option
and it also shows

Code: Select all

[1478174.199284] XFS (dm-0): unknown mount option [noac].
I tried to write the output files to another drive (with ext4 file system) and get

Code: Select all

MPI_File_open failed: Access denied to file, error stack:
(unknown)(): Access denied to file
rank 0 in job 1  centos_36644   caused collective abort of all ranks
  exit status of rank 0: return code 1
I am not familiar with the parallel I/O; if there possible ways to solve this? (I found that this is XFS instead of NFS, would this make difference for this issue?)

Thanks,
Cheng-Ying

soliday
Posts: 390
Joined: 28 May 2008, 09:15

Re: Terminated by SIGSEGVProgram trace-back, when generating a bunch of electrons

Post by soliday » 06 Nov 2020, 10:57

The noac option is specific to nfs, not xfs.
I just took the original files you posted and ran it on an office computer on an xfs mounted drive. The results were not corrupted for me.

My launch command was:
/usr/lib64/mpich-3.2/bin/mpirun -np 12 Pelegant test_gen.ele

The xfs drive is mounted with the following options:
(rw,relatime,attr2,inode64,noquota)
You can check yours with the command "mount | grep /data"

I would suggest that you try installing the system packages:
mpich-3.2-devel-3.2-2.el7.x86_64
mpich-3.2-3.2-2.el7.x86_64

And then installing:
elegant-2020.4.0-5.rhel.7.mpich.3.2.x86_64.rpm

Then try running:
/usr/lib64/mpich-3.2/bin/mpirun -np 12 /usr/bin/Pelegant test_gen.ele

I suspect there is something broken in the MPI version you are using.

jcytsai
Posts: 41
Joined: 01 Oct 2012, 20:18

Re: Terminated by SIGSEGVProgram trace-back, when generating a bunch of electrons

Post by jcytsai » 07 Nov 2020, 03:31

Dear Dr. Soliday,

Following your suggestion, it works now, finally!

Thanks!
Cheng-Ying

Post Reply