Page 1 of 1

Different simulation time in two systems

Posted: 15 Nov 2019, 09:18
by Xu_Liu
Hello,

Recently, I tried to run a totally same code in two systems: Ubuntu 18.04(workstation, 2.4GHz, 2 CPUs, 40 Cores) and macOS High Sierra(Macbook Pro, 2.7GHz, 1 CPU, 2 Cores). While the compution time in these two platforms is quite different:

macOS: 00:00:38
Ubuntu: 00:09:34

Thus how to increase the computation speed in the Ubuntu system? Thank you!

Re: Different simulation time in two systems

Posted: 15 Nov 2019, 09:59
by soliday
I am seeing it too. It is printing out odd time statistics at the completion:

statistics: ET: 00:06:16 CP: 35.14 BIO:0 DIO:0 PF:0 MEM:19768

This is telling me that it spent 35 seconds using CPU time but took more than 6 minutes to complete. When I run top I see that it is spending most of it's time in a D state instead of an R state which I also don't understand yet. I will look into this further.

Re: Different simulation time in two systems

Posted: 15 Nov 2019, 12:18
by soliday
This appears to be related to the type of file system. I tested this on our cluster on three different file systems and here is what I got:
1. Lustre
About 50 percent pause time. Probably due to it being heavily used by many other jobs on the cluster at the moment.
2. NFS
About 0 percent pause time.
3. EXT4
About 85 percent pause time. I am guessing this is probably what you are using.

When I run elegant and look at the stack trace, I see it is sitting in:

[<ffffffffa01625a5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[<ffffffffa0162938>] jbd2_complete_transaction+0x68/0xb0 [jbd2]
[<ffffffffa0174061>] ext4_sync_file+0x121/0x1d0 [ext4]
[<ffffffff811c0821>] vfs_fsync_range+0xa1/0x100
[<ffffffff811c08ed>] vfs_fsync+0x1d/0x20
[<ffffffff811c092e>] do_fsync+0x3e/0x60
[<ffffffff811c0980>] sys_fsync+0x10/0x20
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

This is basically telling me that it is waiting for the file output to be written turning these pauses. That would explain why writing to different file systems results in different speeds. My MacOS laptop has an SSD which doesn't appear to introduce any significant lag when writing files. The EXT4 filesystem on our compute nodes in the cluster are not SSD and as a result, much slower.

Not sure if there is a software solution to this disparity, other than not writing as much intermediate output.

Re: Different simulation time in two systems

Posted: 18 Nov 2019, 01:46
by Xu_Liu
Thanks for your test and detailed explanations.
The file system in my workstation is NTFS. Anyway, I will take the simulations in my MacOS laptop for faster calculation speed.

Best Regards, Xu Liu

Re: Different simulation time in two systems

Posted: 20 Nov 2019, 16:55
by michael_borland
For the parallel version, better performance is often obtained by modifying the MPI IO parameters. E.g., try putting this command at the top of the .ele file:

Code: Select all

&global_settings
        mpi_io_read_buffer_size = 400000,
        mpi_io_write_buffer_size = 400000,
&end
--Michael