Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

Per-rank timing heterogeneity in MPI_File_Write_at_all

4f0drlp7eyj3
Beginner
914 Views
Hi, I ran a timing test for 16 ranks writing 500 MB each from a single Skylake node to a Lustre PFS. Simple C test code (which I can provide), built with Intel 2020.1.217. What I see is Rank 1 has diff 1.749272 Rank 8 has diff 1.764557 Rank 11 has diff 1.777109 Rank 4 has diff 1.782356 Rank 6 has diff 1.833384 Rank 15 has diff 1.858618 Rank 0 has diff 3.101054 Rank 10 has diff 4.237715 Rank 12 has diff 4.288582 Rank 7 has diff 4.291451 Rank 3 has diff 4.295812 Rank 14 has diff 4.302241 Rank 5 has diff 4.339086 Rank 13 has diff 5.141735 Rank 9 has diff 5.141858 Rank 2 has diff 5.209390 With another MPI implementation, I see Rank 0 has diff 7.493546 Rank 10 has diff 7.493548 Rank 13 has diff 7.493549 Rank 14 has diff 7.493545 Rank 15 has diff 7.493545 Rank 1 has diff 7.493545 Rank 2 has diff 7.493544 Rank 3 has diff 7.493544 Rank 4 has diff 7.493544 Rank 5 has diff 7.493546 Rank 6 has diff 7.493545 Rank 7 has diff 7.493546 Rank 8 has diff 7.493552 Rank 9 has diff 7.493545 Rank 11 has diff 7.493548 Rank 12 has diff 7.493545 My question is, why is Intel's timing so heterogeneous? The two implementations clearly are using different algorithms, but Intel looks like it's getting better timings somehow (through buffering, scheduling, ?). Thanks; Chris
0 Kudos
4 Replies
4f0drlp7eyj3
Beginner
910 Views

Sorry, hate this board. Let's try for better formatting.

Hi, I ran a timing test for 16 ranks writing 500 MB each from a single Skylake node to a Lustre PFS. Simple C test code (which I can provide), built with Intel 2020.1.217. What I see is

Rank 1 has diff 1.749272
Rank 8 has diff 1.764557
Rank 11 has diff 1.777109
Rank 4 has diff 1.782356
Rank 6 has diff 1.833384
Rank 15 has diff 1.858618
Rank 0 has diff 3.101054
Rank 10 has diff 4.237715
Rank 12 has diff 4.288582
Rank 7 has diff 4.291451
Rank 3 has diff 4.295812
Rank 14 has diff 4.302241
Rank 5 has diff 4.339086
Rank 13 has diff 5.141735
Rank 9 has diff 5.141858
Rank 2 has diff 5.209390

With another MPI implementation, I see
Rank 0 has diff 7.493546
Rank 10 has diff 7.493548
Rank 13 has diff 7.493549
Rank 14 has diff 7.493545
Rank 15 has diff 7.493545
Rank 1 has diff 7.493545
Rank 2 has diff 7.493544
Rank 3 has diff 7.493544
Rank 4 has diff 7.493544
Rank 5 has diff 7.493546
Rank 6 has diff 7.493545
Rank 7 has diff 7.493546
Rank 8 has diff 7.493552
Rank 9 has diff 7.493545
Rank 11 has diff 7.493548
Rank 12 has diff 7.493545

My question is, why is Intel's timing so heterogeneous? The two implementations clearly are using different algorithms, but Intel looks like it's getting better timings somehow (through buffering, scheduling, ?).

Thanks; Chris

0 Kudos
SantoshY_Intel
Moderator
876 Views

Hi,

 

Thanks for reaching out to us.

>>" I ran a timing test for 16 ranks writing 500 MB each from a single Skylake node to a Lustre PFS. Simple C test code (which I can provide), built with Intel 2020.1.217."

-- Can you please share a sample reproducer code?

 

Thanks & Regards,

Santosh

 

0 Kudos
SantoshY_Intel
Moderator
805 Views

Hi,


Could you please provide the above-requested details (Sample reproducer code)?


Thanks & Regards,

Santosh


0 Kudos
SantoshY_Intel
Moderator
663 Views

Hi,

As we have worked with you internally and upon request, we are closing this thread.


Thanks & Regards,

Santosh


0 Kudos
Reply