Hello,
Recently, I tried to run a totally same code in two systems: Ubuntu 18.04(workstation, 2.4GHz, 2 CPUs, 40 Cores) and macOS High Sierra(Macbook Pro, 2.7GHz, 1 CPU, 2 Cores). While the compution time in these two platforms is quite different:
macOS: 00:00:38
Ubuntu: 00:09:34
Thus how to increase the computation speed in the Ubuntu system? Thank you!
Different simulation time in two systems
Moderators: cyao, michael_borland
Different simulation time in two systems
- Attachments
-
- run.ele
- (3.94 KiB) Downloaded 150 times
-
- beamline.lte
- (8.93 KiB) Downloaded 155 times
Re: Different simulation time in two systems
I am seeing it too. It is printing out odd time statistics at the completion:
statistics: ET: 00:06:16 CP: 35.14 BIO:0 DIO:0 PF:0 MEM:19768
This is telling me that it spent 35 seconds using CPU time but took more than 6 minutes to complete. When I run top I see that it is spending most of it's time in a D state instead of an R state which I also don't understand yet. I will look into this further.
statistics: ET: 00:06:16 CP: 35.14 BIO:0 DIO:0 PF:0 MEM:19768
This is telling me that it spent 35 seconds using CPU time but took more than 6 minutes to complete. When I run top I see that it is spending most of it's time in a D state instead of an R state which I also don't understand yet. I will look into this further.
Re: Different simulation time in two systems
This appears to be related to the type of file system. I tested this on our cluster on three different file systems and here is what I got:
1. Lustre
About 50 percent pause time. Probably due to it being heavily used by many other jobs on the cluster at the moment.
2. NFS
About 0 percent pause time.
3. EXT4
About 85 percent pause time. I am guessing this is probably what you are using.
When I run elegant and look at the stack trace, I see it is sitting in:
[<ffffffffa01625a5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[<ffffffffa0162938>] jbd2_complete_transaction+0x68/0xb0 [jbd2]
[<ffffffffa0174061>] ext4_sync_file+0x121/0x1d0 [ext4]
[<ffffffff811c0821>] vfs_fsync_range+0xa1/0x100
[<ffffffff811c08ed>] vfs_fsync+0x1d/0x20
[<ffffffff811c092e>] do_fsync+0x3e/0x60
[<ffffffff811c0980>] sys_fsync+0x10/0x20
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
This is basically telling me that it is waiting for the file output to be written turning these pauses. That would explain why writing to different file systems results in different speeds. My MacOS laptop has an SSD which doesn't appear to introduce any significant lag when writing files. The EXT4 filesystem on our compute nodes in the cluster are not SSD and as a result, much slower.
Not sure if there is a software solution to this disparity, other than not writing as much intermediate output.
1. Lustre
About 50 percent pause time. Probably due to it being heavily used by many other jobs on the cluster at the moment.
2. NFS
About 0 percent pause time.
3. EXT4
About 85 percent pause time. I am guessing this is probably what you are using.
When I run elegant and look at the stack trace, I see it is sitting in:
[<ffffffffa01625a5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[<ffffffffa0162938>] jbd2_complete_transaction+0x68/0xb0 [jbd2]
[<ffffffffa0174061>] ext4_sync_file+0x121/0x1d0 [ext4]
[<ffffffff811c0821>] vfs_fsync_range+0xa1/0x100
[<ffffffff811c08ed>] vfs_fsync+0x1d/0x20
[<ffffffff811c092e>] do_fsync+0x3e/0x60
[<ffffffff811c0980>] sys_fsync+0x10/0x20
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
This is basically telling me that it is waiting for the file output to be written turning these pauses. That would explain why writing to different file systems results in different speeds. My MacOS laptop has an SSD which doesn't appear to introduce any significant lag when writing files. The EXT4 filesystem on our compute nodes in the cluster are not SSD and as a result, much slower.
Not sure if there is a software solution to this disparity, other than not writing as much intermediate output.
Re: Different simulation time in two systems
Thanks for your test and detailed explanations.
The file system in my workstation is NTFS. Anyway, I will take the simulations in my MacOS laptop for faster calculation speed.
Best Regards, Xu Liu
The file system in my workstation is NTFS. Anyway, I will take the simulations in my MacOS laptop for faster calculation speed.
Best Regards, Xu Liu
-
- Posts: 1951
- Joined: 19 May 2008, 09:33
- Location: Argonne National Laboratory
- Contact:
Re: Different simulation time in two systems
For the parallel version, better performance is often obtained by modifying the MPI IO parameters. E.g., try putting this command at the top of the .ele file:
--Michael
Code: Select all
&global_settings
mpi_io_read_buffer_size = 400000,
mpi_io_write_buffer_size = 400000,
&end