Different simulation time in two systems

Moderators: cyao, michael_borland

Post Reply
Xu_Liu
Posts: 4
Joined: 19 Sep 2018, 13:06

Different simulation time in two systems

Post by Xu_Liu » 15 Nov 2019, 09:18

Hello,

Recently, I tried to run a totally same code in two systems: Ubuntu 18.04(workstation, 2.4GHz, 2 CPUs, 40 Cores) and macOS High Sierra(Macbook Pro, 2.7GHz, 1 CPU, 2 Cores). While the compution time in these two platforms is quite different:

macOS: 00:00:38
Ubuntu: 00:09:34

Thus how to increase the computation speed in the Ubuntu system? Thank you!
Attachments
run.ele
(3.94 KiB) Downloaded 126 times
beamline.lte
(8.93 KiB) Downloaded 125 times

soliday
Posts: 390
Joined: 28 May 2008, 09:15

Re: Different simulation time in two systems

Post by soliday » 15 Nov 2019, 09:59

I am seeing it too. It is printing out odd time statistics at the completion:

statistics: ET: 00:06:16 CP: 35.14 BIO:0 DIO:0 PF:0 MEM:19768

This is telling me that it spent 35 seconds using CPU time but took more than 6 minutes to complete. When I run top I see that it is spending most of it's time in a D state instead of an R state which I also don't understand yet. I will look into this further.

soliday
Posts: 390
Joined: 28 May 2008, 09:15

Re: Different simulation time in two systems

Post by soliday » 15 Nov 2019, 12:18

This appears to be related to the type of file system. I tested this on our cluster on three different file systems and here is what I got:
1. Lustre
About 50 percent pause time. Probably due to it being heavily used by many other jobs on the cluster at the moment.
2. NFS
About 0 percent pause time.
3. EXT4
About 85 percent pause time. I am guessing this is probably what you are using.

When I run elegant and look at the stack trace, I see it is sitting in:

[<ffffffffa01625a5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[<ffffffffa0162938>] jbd2_complete_transaction+0x68/0xb0 [jbd2]
[<ffffffffa0174061>] ext4_sync_file+0x121/0x1d0 [ext4]
[<ffffffff811c0821>] vfs_fsync_range+0xa1/0x100
[<ffffffff811c08ed>] vfs_fsync+0x1d/0x20
[<ffffffff811c092e>] do_fsync+0x3e/0x60
[<ffffffff811c0980>] sys_fsync+0x10/0x20
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

This is basically telling me that it is waiting for the file output to be written turning these pauses. That would explain why writing to different file systems results in different speeds. My MacOS laptop has an SSD which doesn't appear to introduce any significant lag when writing files. The EXT4 filesystem on our compute nodes in the cluster are not SSD and as a result, much slower.

Not sure if there is a software solution to this disparity, other than not writing as much intermediate output.

Xu_Liu
Posts: 4
Joined: 19 Sep 2018, 13:06

Re: Different simulation time in two systems

Post by Xu_Liu » 18 Nov 2019, 01:46

Thanks for your test and detailed explanations.
The file system in my workstation is NTFS. Anyway, I will take the simulations in my MacOS laptop for faster calculation speed.

Best Regards, Xu Liu

michael_borland
Posts: 1927
Joined: 19 May 2008, 09:33
Location: Argonne National Laboratory
Contact:

Re: Different simulation time in two systems

Post by michael_borland » 20 Nov 2019, 16:55

For the parallel version, better performance is often obtained by modifying the MPI IO parameters. E.g., try putting this command at the top of the .ele file:

Code: Select all

&global_settings
        mpi_io_read_buffer_size = 400000,
        mpi_io_write_buffer_size = 400000,
&end
--Michael

Post Reply