Moderators: cyao, michael_borland
-
JoelFrederico
- Posts: 60
- Joined: 05 Aug 2010, 11:32
- Location: SLAC National Accelerator Laboratory
Post
by JoelFrederico » 06 May 2011, 15:53
I've noticed there's a strange crash at the end of this simulation:
Code: Select all
=====================================================================================
Thanks for using Pelegant. Please cite the following references in your publications:
M. Borland, "elegant: A Flexible SDDS-Compliant Code for Accelerator Simulation,"
Advanced Photon Source LS-287, September 2000.
Y. Wang and M. Borland, "Pelegant: A Parallel Accelerator Simulation Code for
Electron Generation and Tracking, Proceedings of the 12th Advanced Accelerator
Concepts Workshop, 2006.
If you use a modified version, please indicate this in all publications.
=====================================================================================
*** An error occurred in MPI_Barrier
*** after MPI was finalized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
*** An error occurred in MPI_Barrier
*** after MPI was finalized
[oak068:26519] Abort after MPI_FINALIZE completed successfully; not able to guarantee that all other processes were killed!
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[oak068:26520] Abort after MPI_FINALIZE completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 26518 on
node oak068 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
==== ========== ================ ======================= ===================
0001 oak068 mympirun_wrapper Exit (1) 05/06/2011 13:47:48
0002 oak068 mympirun_wrapper Exit (1) 05/06/2011 13:47:48
It seems to have something to do with the run_control settings (possibly first_is_fiducial), although I can't seem to figure out exactly what's triggering it. It doesn't seem to be a huge problem as far as the sim goes, since I haven't had a problem with the final data being written, but my guess is it's not cleaning up correctly.
http://www.stanford.edu/~joelfred/Peleg ... ier.tar.gz
-
ywang25
- Posts: 52
- Joined: 10 Jun 2008, 19:48
Post
by ywang25 » 06 May 2011, 16:06
Joel,
It appears the memory is not handled properly somewhere in the code.
Can you check the file you uploaded? I downloaded and got 0 byte.
Thanks,
Yusong
-
JoelFrederico
- Posts: 60
- Joined: 05 Aug 2010, 11:32
- Location: SLAC National Accelerator Laboratory
Post
by JoelFrederico » 06 May 2011, 16:14
Sorry, went over my quota. It should be there now.
-
ywang25
- Posts: 52
- Joined: 10 Jun 2008, 19:48
Post
by ywang25 » 11 May 2011, 08:56
Joel,
I tested your example on two clusters: both have Red Hat 4.1.2 installed. One test uses MVAPICH2 1.4.0rc1 with Infiniband network and the other uses MPICH2 version 1.2.1. The problem your described didn't show on both cases. I also used a memory debugger and it didn't report any problem for this simple simulation.
In both tests, the Pelegant is built natively on each of the clusters. I am afraid there could be minor portability issue if the environment you ran is not the same as where Pelegant was built.
Yusong
-
JoelFrederico
- Posts: 60
- Joined: 05 Aug 2010, 11:32
- Location: SLAC National Accelerator Laboratory
Post
by JoelFrederico » 17 May 2011, 18:16
Thanks, Yusong,
Pelegant was compiled for this system, it's using NFS for the file system, and OpenMPI with Infiniband for communication. Since it's not causing problems in the simulation as far as I can tell, I won't worry about it. It seems reproducible with given settings, but I haven't really invested in finding which setting causes it. I'll let you know if it becomes a problem or I find a pattern.
JOel