Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Measuring Memory Bandwidth of a System ( MBS )

SergeyKostrov
Valued Contributor II
602 Views

Measuring a Memory Bandwidth of a System ( MBS ) is a tricky task. In my test I wanted to prove that
MBS depends on a priority of an application that measures it.

In order to measure MBS I used a modified Test-Case provided by Patrick Fay (Intel) in a thread:

http://software.intel.com/en-us/forums/showthread.php?t=102690&o=a&s=lr

from a Post #10. Please take a look at my data:

Process Priority IDLE ( PPI ):

Test 01: Memory Bandwidth [ 1647.523 MB/sec 1.609 GB/sec ] Array size: 32 MB
Test 02: Memory Bandwidth [ 1865.626 MB/sec 1.822 GB/sec ] Array size: 32 MB
Test 03: Memory Bandwidth [ 1868.982 MB/sec 1.825 GB/sec ] Array size: 32 MB
Test 04: Memory Bandwidth [ 1868.982 MB/sec 1.825 GB/sec ] Array size: 32 MB
Test 05: Memory Bandwidth [ 1868.982 MB/sec 1.825 GB/sec ] Array size: 32 MB
Test 06: Memory Bandwidth [ 1868.982 MB/sec 1.825 GB/sec ] Array size: 32 MB
Test 07: Memory Bandwidth [ 1875.693 MB/sec 1.832 GB/sec ] Array size: 32 MB
Test 08: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 09: Memory Bandwidth [ 1875.693 MB/sec 1.832 GB/sec ] Array size: 32 MB
Test 10: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 11: Memory Bandwidth [ 1875.693 MB/sec 1.832 GB/sec ] Array size: 32 MB
Test 12: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 13: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 14: Memory Bandwidth [ 1875.693 MB/sec 1.832 GB/sec ] Array size: 32 MB
Test 15: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 16: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB

Max MBS for PPI: [ 1879.048 MB/sec 1.835 GB/sec ]

Process Priority NORMAL ( PPN ):

Test 01: Memory Bandwidth [ 1858.916 MB/sec 1.815 GB/sec ] Array size: 32 MB
Test 02: Memory Bandwidth [ 1865.626 MB/sec 1.822 GB/sec ] Array size: 32 MB
Test 03: Memory Bandwidth [ 1865.626 MB/sec 1.822 GB/sec ] Array size: 32 MB
Test 04: Memory Bandwidth [ 1865.626 MB/sec 1.822 GB/sec ] Array size: 32 MB
Test 05: Memory Bandwidth [ 1868.982 MB/sec 1.825 GB/sec ] Array size: 32 MB
Test 06: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 07: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 08: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 09: Memory Bandwidth [ 1872.337 MB/sec 1.828 GB/sec ] Array size: 32 MB
Test 10: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 11: Memory Bandwidth [ 1875.693 MB/sec 1.832 GB/sec ] Array size: 32 MB
Test 12: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 13: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 14: Memory Bandwidth [ 1875.693 MB/sec 1.832 GB/sec ] Array size: 32 MB
Test 15: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 16: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB

Max MBS for PPN: [ 1879.048 MB/sec 1.835 GB/sec ]

Process Priority HIGH ( PPH ):

Test 01: Memory Bandwidth [ 1875.693 MB/sec 1.832 GB/sec ] Array size: 32 MB
Test 02: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 03: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 04: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 05: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 06: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 07: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 08: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 09: Memory Bandwidth [ 1879.048 MB/sec 1.835 GB/sec ] Array size: 32 MB
Test 10: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 11: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 12: Memory Bandwidth [ 1885.759 MB/sec 1.842 GB/sec ] Array size: 32 MB
Test 13: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 14: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 15: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB
Test 16: Memory Bandwidth [ 1882.404 MB/sec 1.838 GB/sec ] Array size: 32 MB

Max MBS for PPH: [ 1885.759 MB/sec 1.842 GB/sec ]

Process Priority REALTIME ( PPR ):

Test 01: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 02: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 03: Memory Bandwidth [ 1885.759 MB/sec 1.842 GB/sec ] Array size: 32 MB
Test 04: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 05: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 06: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 07: Memory Bandwidth [ 1885.759 MB/sec 1.842 GB/sec ] Array size: 32 MB
Test 08: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 09: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 10: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 11: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 12: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 13: Memory Bandwidth [ 1885.759 MB/sec 1.842 GB/sec ] Array size: 32 MB
Test 14: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 15: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
Test 16: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB

Max MBS for PPR: [ 1889.115 MB/sec 1.845 GB/sec ]

Summary:

Max MBS for PPR: [ 1889.115 MB/sec 1.845 GB/sec ] ( 100.00% )
Max MBS for PPH: [ 1885.759 MB/sec 1.842 GB/sec ] ( 99.82% )
Max MBS for PPN: [ 1879.048 MB/sec 1.835 GB/sec ] ( 99.47% )
Max MBS for PPI : [ 1879.048 MB/sec 1.835 GB/sec ] (99.47% )

Final MBS value:

MBS: 1889.115 MB/sec ( 1.845 GB/sec ) when a process priority was Realtime.

0 Kudos
1 Solution
Patrick_F_Intel1
Employee
602 Views
Hello Sergey,
Yeah, many utilities raise the priority of the threads doing the memory bw test.
Also, if you are doing a single-threaded memory bw test, it can matter on which cpu you are running.
On some operating systems, cpu 0 for example, can be pretty busy and the mem bw thread won't get as much time as on other cpus.
For some of my utilities I also report the user cpu time for the thread and compare that to the elapsed time.
Usually this will account for the difference you see above without having to mess with priorities.
Although you are only seeing a difference of ~0.54% max PPR compared to PPI, I have seen more variation (up to a few percent IIRC).
Pat

View solution in original post

0 Kudos
11 Replies
Patrick_F_Intel1
Employee
603 Views
Hello Sergey,
Yeah, many utilities raise the priority of the threads doing the memory bw test.
Also, if you are doing a single-threaded memory bw test, it can matter on which cpu you are running.
On some operating systems, cpu 0 for example, can be pretty busy and the mem bw thread won't get as much time as on other cpus.
For some of my utilities I also report the user cpu time for the thread and compare that to the elapsed time.
Usually this will account for the difference you see above without having to mess with priorities.
Although you are only seeing a difference of ~0.54% max PPR compared to PPI, I have seen more variation (up to a few percent IIRC).
Pat
0 Kudos
SergeyKostrov
Valued Contributor II
602 Views
Hi Patrick,

Quoting Patrick Fay (Intel)
...
Yeah, many utilities raise the priority of the threads doing the memory bw test.

[SergeyK] Thanks for the information. I didn't know this.

Also, if you are doing a single-threaded memory bw test, it can matter on which cpu you are running.
On some operating systems, cpu 0 for example, can be pretty busy and the mem bw thread won't get as much time as on other cpus.

[SergeyK] So far I've done tests on a single CPU computer with an application that has just one thread.

...


I've prepared a couple of more data. Please take a look as soon as you have time. I appreciate your comments.

Best regards,
Sergey

0 Kudos
SergeyKostrov
Valued Contributor II
602 Views

I understood that there are many "variables" that affect MBS value and a size of a test Array is
one of them.

Pleasetake a look at how MBS value changes when a size of the test Array increases:


Process Priority REALTIME ( PPR ) - Array size 32 MB
:
...
Test 02: Memory Bandwidth [ 1889.115 MB/sec 1.845 GB/sec ] Array size: 32 MB
...

Process Priority REALTIME ( PPR ) - Array size 64 MB:
...
Test 02: Memory Bandwidth [ 1892.470 MB/sec 1.848 GB/sec ] Array size: 64 MB
...

Process Priority REALTIME ( PPR ) - Array size 128 MB:

Test 01: Memory Bandwidth [ 1892.470 MB/sec 1.848 GB/sec ] Array size: 128 MB
Test 02: Memory Bandwidth [ 1892.470 MB/sec 1.848 GB/sec ] Array size: 128 MB
Test 03: Memory Bandwidth [ 1892.470 MB/sec 1.848 GB/sec ] Array size: 128 MB
Test 04: Memory Bandwidth [ 1892.470 MB/sec 1.848 GB/sec ] Array size: 128 MB
...
Test 16: Memory Bandwidth [ 1892.470 MB/sec 1.848 GB/sec ] Array size: 128 MB

Process Priority REALTIME ( PPR ) - Array size 256 MB:

Test 01: Memory Bandwidth [ 1905.892 MB/sec 1.861 GB/sec ] Array size: 256 MB
Test 02: Memory Bandwidth [ 1905.892 MB/sec 1.861 GB/sec ] Array size: 256 MB
Test 03: Memory Bandwidth [ 1905.892 MB/sec 1.861 GB/sec ] Array size: 256 MB
Test 04: Memory Bandwidth [ 1905.892 MB/sec 1.861 GB/sec ] Array size: 256 MB
...
Test 16: Memory Bandwidth [ 1905.892 MB/sec 1.861 GB/sec ] Array size: 256 MB

Process Priority REALTIME ( PPR ) - Array size 512 MB:

Test 01: Memory Bandwidth [ 1932.735 MB/sec 1.887 GB/sec ] Array size: 512 MB
Test 02: Memory Bandwidth [ 1932.735 MB/sec 1.887 GB/sec ] Array size: 512 MB
Test 03: Memory Bandwidth [ 1932.735 MB/sec 1.887 GB/sec ] Array size: 512 MB
Test 04: Memory Bandwidth [ 1932.735 MB/sec 1.887 GB/sec ] Array size: 512 MB
...
Test 16: Memory Bandwidth [ 1932.735 MB/sec 1.887 GB/sec ] Array size: 512 MB

Process Priority REALTIME ( PPR ) - Array size 1024 MB ( 1GB ):

Test 01: Memory Bandwidth [ 3.821 MB/sec 0.004 GB/sec ] Array size: 1024 MB
Test 02: Memory Bandwidth [ 5.187 MB/sec 0.005 GB/sec ] Array size: 1024 MB
Test 03: Memory Bandwidth [ 4.793 MB/sec 0.005 GB/sec ] Array size: 1024 MB
Test 04: Memory Bandwidth [ 8.948 MB/sec 0.009 GB/sec ] Array size: 1024 MB
Test 05: Memory Bandwidth [ 9.419 MB/sec 0.009 GB/sec ] Array size: 1024 MB
Test 06: Memory Bandwidth [ 5.711 MB/sec 0.006 GB/sec ] Array size: 1024 MB
Test 07: Memory Bandwidth [ 6.279 MB/sec 0.006 GB/sec ] Array size: 1024 MB
Test 08: Memory Bandwidth [ 5.289 MB/sec 0.005 GB/sec ] Array size: 1024 MB
Test 09: Memory Bandwidth [ 5.238 MB/sec 0.005 GB/sec ] Array size: 1024 MB
Test 10: Memory Bandwidth [ 5.423 MB/sec 0.005 GB/sec ] Array size: 1024 MB
Test 11: Memory Bandwidth [ 7.206 MB/sec 0.007 GB/sec ] Array size: 1024 MB
Test 12: Memory Bandwidth [ 7.838 MB/sec 0.008 GB/sec ] Array size: 1024 MB
Test 13: Memory Bandwidth [ 7.018 MB/sec 0.007 GB/sec ] Array size: 1024 MB
Test 14: Memory Bandwidth [ 6.628 MB/sec 0.006 GB/sec ] Array size: 1024 MB
Test 15: Memory Bandwidth [ 6.839 MB/sec 0.007 GB/sec ] Array size: 1024 MB
Test 16: Memory Bandwidth [ 6.628 MB/sec 0.006 GB/sec ] Array size: 1024 MB

Note: Performance is significantly affected. A Virtual Memory ( VM ) file was used and
a VM manager is preempted most of the time.

Summary:

Max MBS for PPR: 1932.735 MB/sec ( 1.887 GB/sec ) - Array size of 512 MB ( 100.00 % )
Max MBS for PPR: 1905.892 MB/sec ( 1.861 GB/sec ) - Array size of 256 MB ( 98.61 % )
Max MBS for PPR: 1892.470 MB/sec ( 1.848 GB/sec ) - Array size of 128 MB ( 97.92 % )

These MBS values are absolutely reproducibleon mysystem for a Debug and Release configurations.

Final MBS value:

1932.735 MB/sec ( 1.887 GB/sec ) for Array size of 512 MB ( a process prioritywasRealtime )

0 Kudos
SergeyKostrov
Valued Contributor II
602 Views

A Test-Case when a size of the Array is 1024 MB ( 1 GB ) is the most inaccurate becausethere areI/O
operationswith a drive and a Virtual Memory manager is preempted most of the time. There is some
improvement when the process priority is changedto NORMAL from REALTIME:


Process Priority REALTIME ( PPR ) - Array size 1024 MB ( 1GB ):

Test 01: Memory Bandwidth [ 3.821 MB/sec 0.004 GB/sec ] Array size: 1024 MB
Test 02: Memory Bandwidth [ 5.187 MB/sec 0.005 GB/sec ] Array size: 1024 MB
Test 03: Memory Bandwidth [ 4.793 MB/sec 0.005 GB/sec ] Array size: 1024 MB
Test 04: Memory Bandwidth [ 8.948 MB/sec 0.009 GB/sec ] Array size: 1024 MB
Test 05: Memory Bandwidth [ 9.419 MB/sec 0.009 GB/sec ] Array size: 1024 MB
Test 06: Memory Bandwidth [ 5.711 MB/sec 0.006 GB/sec ] Array size: 1024 MB
Test 07: Memory Bandwidth [ 6.279 MB/sec 0.006 GB/sec ] Array size: 1024 MB
Test 08: Memory Bandwidth [ 5.289 MB/sec 0.005 GB/sec ] Array size: 1024 MB
...

Max MBS for PPR: [ 9.419 MB/sec 0.009 GB/sec ] - Array size: 1024 MB

Note: Performance is significantly affected. MBS value is inaccurate.



Process Priority NORMAL ( PPN ) - Array size 1024 MB ( 1GB )
:

Test 01: Memory Bandwidth [ 46.684 MB/sec 0.046 GB/sec ] Array size: 1024 MB
Test 02: Memory Bandwidth [ 27.532 MB/sec 0.027 GB/sec ] Array size: 1024 MB
Test 03: Memory Bandwidth [ 31.581 MB/sec 0.031 GB/sec ] Array size: 1024 MB
Test 04: Memory Bandwidth [ 28.256 MB/sec 0.028 GB/sec ] Array size: 1024 MB
Test 05: Memory Bandwidth [ 30.678 MB/sec 0.030 GB/sec ] Array size: 1024 MB
Test 06: Memory Bandwidth [ 26.189 MB/sec 0.026 GB/sec ] Array size: 1024 MB
Test 07: Memory Bandwidth [ 27.532 MB/sec 0.027 GB/sec ] Array size: 1024 MB
Test 08: Memory Bandwidth [ 24.971 MB/sec 0.024 GB/sec ] Array size: 1024 MB
...

Max MBS for PPN: [ 46.684 MB/sec 0.046 GB/sec ] - Array size: 1024 MB

Note: Preformance is less affected. MBS value is inaccurate.


Conclusion: Based on my tests a512 MB size for the test Array gives the most accurate values for MBS.

0 Kudos
SergeyKostrov
Valued Contributor II
602 Views
Another"variable" that affect MBS value is a C/C++ compiler. In thattest theapplication was compiled
by aMinGW C/C++ compiler:


Process Priority REALTIME ( PPR ) - Array size 512 MB:


Test 01: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 02: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 03: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 04: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 05: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 06: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 07: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 08: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 09: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 10: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 11: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 12: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 13: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 14: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 15: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB
Test 16: Memory Bandwidth [ 1986.422 MB/sec 1.940 GB/sec ] Array size: 512 MB

Max MBS for PPR: [ 1986.422 MB/sec 1.940 GB/sec ] - Array size: 512 MB

BestMBS value:

1986.422 MB/sec ( 1.940 GB/sec ) for Array size of 512 MB ( a process prioritywasRealtime )

Summary:

MBS: 1986.422 MB/sec ( 1.940 GB/sec ) - Array size of 512 MB ( 100.00 % ) - MinGW - Best Value
MBS: 1932.735 MB/sec ( 1.887 GB/sec ) - Array size of 512 MB ( 97.30 % ) - MSVC

0 Kudos
Patrick_F_Intel1
Employee
602 Views

Hey Sergey,
Here is the approach I've adopted after a dozen years of measuring memory bw.
For my own memory/latency utility, I use assembly code.
This avoids the variation in performance due to different compilers.
My utility has to work on all variations oflinux, windows, android, etc.
This approach means I have 4 asm files (32 & 64 bit windows, 32 & 64 bit linux) but I rarely have to change them.
In general I'm more interested in the relative performance of systems than I am in the absolute best performance.
That is, internally folks use my utility to compare box1 to box2, and, if the bw is off by much, then they have to start digging to see where the difference is. Or some folks run my utility before each benchmark measurement they take to see if someone haschanged (changed the DIMMs, or bios settings) the box (usually a shared box in a lab).
My approach usually gets within a few percent of the absolute best performance.
This works ok for our platform debug, sanity check purposes.
For external (outside of Intel) purposes, we usually use the stream benchmark. See http://www.cs.virginia.edu/stream.
I don't use stream much but it is pretty much the industry standard for mem bw numbers.
Roman has a paper on using PCM to dissect stream at http://software.intel.com/en-us/blogs/2010/11/23/dissecting-stream-benchmark-with-intel-performance-counter-monitor/

My 'simpler' mem bw tests (a read test, a write test, a latency test) also allow me to test various memory bw related perfmon events.
The read test lets mecheck demand read miss events.
The write test lets me check writeback events.
The latency test can be used to check the latency events.
Pat

0 Kudos
SergeyKostrov
Valued Contributor II
602 Views
Hi Patrick,

Thank you and I really appreciate your comments!

Best regards,
Sergey
0 Kudos
levicki
Valued Contributor I
602 Views
Sergey,

If you want to measure bandwidth with more accuracy, you also need to:

- Increase process working set size to be larger than the sum of program+dll size and src/dst/buf size
- Lock the thread affinity so that thread doesn't get swapped to another core (avoid core 0 and logical cores)
- Lock pages in memory to disable swapping (this needs a privilege enabled for the admin account)
- Touch all the pages in src/dst/buf after locking them to eliminate page faults before you start

To maximize the bandwidth, you may consider using a blocking+streaming approach:

1. Read 2KB of data from memory into a buffer (64/128 bytes per loop iteration, movaps, use prefetchnta*)
2. Stream those 2KB of data to memory (64/128 bytes per loop iteration, movntps)
3. Repeat until you copy all data

* - Prefetch data using prefetchnta (optimal prefetch distance is best determined by trial and error)

Hope this helps.
0 Kudos
SergeyKostrov
Valued Contributor II
602 Views
Quoting Igor Levicki
...
- Increase process working set size to be larger than the sum of program+dll size and src/dst/buf size
- Lock the thread affinity so that thread doesn't get swapped to another core (avoid core 0 and logical cores)
...


These two notes are the most applicable in my case. Thanks, Igor.

0 Kudos
Bernard
Valued Contributor I
602 Views
>>>Yeah, many utilities raise the priority of the threads doing the memory bw test. Also, if you are doing a single-threaded memory bw test, it can matter on which cpu you are running. On some operating systems, cpu 0 for example, can be pretty busy and the mem bw thread won't get as much time as on other cpus.>>> If you are on NUMA system the memory testing gets even more complicated , because of NUMA distances beign involved in memory accesses.
0 Kudos
SergeyKostrov
Valued Contributor II
602 Views
Hi Igor, This is a short follow up... >>... >>To maximize the bandwidth, you may consider using a blocking+streaming approach: >> >>1. Read 2KB of data from memory into a buffer (64/128 bytes per loop iteration, movaps, use prefetchnta*) >>2. Stream those 2KB of data to memory (64/128 bytes per loop iteration, movntps) I actually used a 4KB steps when prefetching data in a for- loop iteration. >>3. Repeat until you copy all data >> >>* - Prefetch data using prefetchnta (optimal prefetch distance is best determined by trial and error) I recently done a couple of tests with a function that uses streaming SSE instructions to copy memory blocks and in the best scenario a performance improvement was ~9% (!). I consider it as a very good result and the function significantly outperformed a standard CRT function memcpy. Thanks for your notes.
0 Kudos
Reply