Pelegant performance on Win7_x64

Moderators: cyao, michael_borland

Post Reply
Anisimov
Posts: 8
Joined: 03 Apr 2019, 09:48

Pelegant performance on Win7_x64

Post by Anisimov » 03 Apr 2019, 10:42

I am a brand new user of the elegant and need to use it in order to study microbunching instability of ultra bright beams in x-ray FELs. I have .ele and .lte inputs to start with.

I have installed March 7th, 2019 version and then updated April 1st, 2019 version of Elegant_x64.msi on Windows 7 with MS-MPI 10 by following pelegant instructions.

I have been disappointed by the performance with 1.25M particles:
Old Pelegant with MS-MPI 7 on E5-2687 (8 cores) at 3.1 GHz and -n 14 completes the simulation in 6 minutes
New Pelegant with MS-MPI 10 on dual socket E5-2687v4 (12 cores *2) at 3.0 GHz and -n 14 completes the simulation in 11 minutes.

Pushing -n 48 gives only modest improvement down to 9m.

Are there MS-MPI 10 issues that might be a reason for such a performance? I have read that multiple NIC interfaces can slow down MS-MPI.

Is there a way to debug Pelegant in order to see where it spends its time?

Thank you,
Petr

Here is a list how run time depends on -n option:
Elegant simulation on 1 core
Tracking step completed ET: 00:23:29 CP: 1408.46 BIO:0 DIO:0 PF:0 MEM:0

Pelegant simulation on 2 cores
Tracking step completed ET: 00:32:53 CP: 1972.50 BIO:0 DIO:0 PF:0 MEM:0

Pelegant simulation on 4 cores
Tracking step completed ET: 00:17:03 CP: 1023.05 BIO:0 DIO:0 PF:0 MEM:0

Pelegant simulation on 8 cores
Tracking step completed ET: 00:10:40 CP: 639.25 BIO:0 DIO:0 PF:0 MEM:0

Pelegant simulation on 12 cores
Tracking step completed ET: 00:11:04 CP: 663.66 BIO:0 DIO:0 PF:0 MEM:0

Pelegant simulation on 14 cores
Tracking step completed ET: 00:10:54 CP: 654.28 BIO:0 DIO:0 PF:0 MEM:0

Pelegant simulation on 16 cores
Tracking step completed ET: 00:11:56 CP: 716.02 BIO:0 DIO:0 PF:0 MEM:0

Pelegant simulation on 20 cores
Tracking step completed ET: 00:12:17 CP: 736.92 BIO:0 DIO:0 PF:0 MEM:0

Pelegant simulation on 24 cores
Tracking step completed ET: 00:12:25 CP: 745.36 BIO:0 DIO:0 PF:0 MEM:0

Pelegant simulation on 32 cores
Tracking step completed ET: 00:11:05 CP: 665.06 BIO:0 DIO:0 PF:0 MEM:0

Pelegant simulation on 48 cores
Tracking step completed ET: 00:09:09 CP: 548.50 BIO:0 DIO:0 PF:0 MEM:0

michael_borland
Posts: 1959
Joined: 19 May 2008, 09:33
Location: Argonne National Laboratory
Contact:

Re: Pelegant performance on Win7_x64

Post by michael_borland » 03 Apr 2019, 11:50

Petr,

You can try setting print_statistics=1 in &run_setup to get information on where elegant is spending its time.

Is it possible you are running low on memory?

--Michael

Anisimov
Posts: 8
Joined: 03 Apr 2019, 09:48

Re: Pelegant performance on Win7_x64

Post by Anisimov » 04 Apr 2019, 14:29

Hi Michael,

My system has 128 GB of memory, same as the other workstation I am benchmarking against, while the simulation takes only <15 GB. So , I am safe as far as being low on memory.

Yes, I run my simulation with print_statistics=1. It does not give me enough information to diagnose why my system under-performs.

Thank you,
Petr

michael_borland
Posts: 1959
Joined: 19 May 2008, 09:33
Location: Argonne National Laboratory
Contact:

Re: Pelegant performance on Win7_x64

Post by michael_borland » 04 Apr 2019, 14:35

Petr,

Another issue is that sometimes manufacturers fib about the number of cores vs the number of threads. We usually turn off threading on our machines because running more threads than the available number of cores doesn't usually help.

Filesystem performance can also be a big factor. You can try turning off large output requests (e.g., WATCH elements in coordinate mode) and see if that helps.

If you can supply me with your files, I can run it on one of our systems and see if I can identify the reason for the poor performance.

--Michael

Anisimov
Posts: 8
Joined: 03 Apr 2019, 09:48

Re: Pelegant performance on Win7_x64

Post by Anisimov » 11 Apr 2019, 09:36

Michael,

Thank you for your response. I agree with you regarding threads vs cores in general. In this case however, the performance improvement stops at 8 processes while I have at least 24 physical cores with my dual socket E5-2687W v4 processors.

I have just run the simulation with 0.001 particle fraction being recorded to watch files. It has shorten the simulation by 30 seconds out of 10 minutes run.

Thank you for offering to run my inputs on your system. Let me get the permission to share the files.

Petr

Post Reply