MPI_Barrier

Moderators: cyao, michael_borland

Post Reply
JoelFrederico
Posts: 60
Joined: 05 Aug 2010, 11:32
Location: SLAC National Accelerator Laboratory

MPI_Barrier

Post by JoelFrederico » 06 May 2011, 15:53

I've noticed there's a strange crash at the end of this simulation:

Code: Select all

=====================================================================================
Thanks for using Pelegant.  Please cite the following references in your publications:
  M. Borland, "elegant: A Flexible SDDS-Compliant Code for Accelerator Simulation,"
  Advanced Photon Source LS-287, September 2000.
  Y. Wang and M. Borland, "Pelegant: A Parallel Accelerator Simulation Code for  
  Electron Generation and Tracking, Proceedings of the 12th Advanced Accelerator  
  Concepts Workshop, 2006.
If you use a modified version, please indicate this in all publications.
=====================================================================================
*** An error occurred in MPI_Barrier
*** after MPI was finalized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
*** An error occurred in MPI_Barrier
*** after MPI was finalized
[oak068:26519] Abort after MPI_FINALIZE completed successfully; not able to guarantee that all other processes were killed!
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[oak068:26520] Abort after MPI_FINALIZE completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 26518 on
node oak068 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

TID  HOST_NAME    COMMAND_LINE            STATUS            TERMINATION_TIME
==== ========== ================  =======================  ===================
0001 oak068     mympirun_wrapper  Exit (1)                 05/06/2011 13:47:48
0002 oak068     mympirun_wrapper  Exit (1)                 05/06/2011 13:47:48
It seems to have something to do with the run_control settings (possibly first_is_fiducial), although I can't seem to figure out exactly what's triggering it. It doesn't seem to be a huge problem as far as the sim goes, since I haven't had a problem with the final data being written, but my guess is it's not cleaning up correctly.

http://www.stanford.edu/~joelfred/Peleg ... ier.tar.gz

ywang25
Posts: 52
Joined: 10 Jun 2008, 19:48

Re: MPI_Barrier

Post by ywang25 » 06 May 2011, 16:06

Joel,

It appears the memory is not handled properly somewhere in the code.
Can you check the file you uploaded? I downloaded and got 0 byte.

Thanks,

Yusong

JoelFrederico
Posts: 60
Joined: 05 Aug 2010, 11:32
Location: SLAC National Accelerator Laboratory

Re: MPI_Barrier

Post by JoelFrederico » 06 May 2011, 16:14

Sorry, went over my quota. It should be there now.

ywang25
Posts: 52
Joined: 10 Jun 2008, 19:48

Re: MPI_Barrier

Post by ywang25 » 11 May 2011, 08:56

Joel,

I tested your example on two clusters: both have Red Hat 4.1.2 installed. One test uses MVAPICH2 1.4.0rc1 with Infiniband network and the other uses MPICH2 version 1.2.1. The problem your described didn't show on both cases. I also used a memory debugger and it didn't report any problem for this simple simulation.

In both tests, the Pelegant is built natively on each of the clusters. I am afraid there could be minor portability issue if the environment you ran is not the same as where Pelegant was built.

Yusong

JoelFrederico
Posts: 60
Joined: 05 Aug 2010, 11:32
Location: SLAC National Accelerator Laboratory

Re: MPI_Barrier

Post by JoelFrederico » 17 May 2011, 18:16

Thanks, Yusong,

Pelegant was compiled for this system, it's using NFS for the file system, and OpenMPI with Infiniband for communication. Since it's not causing problems in the simulation as far as I can tell, I won't worry about it. It seems reproducible with given settings, but I haven't really invested in finding which setting causes it. I'll let you know if it becomes a problem or I find a pattern.

JOel

Post Reply