Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

Question about MIC performance of vector(SIMD/non-temporal) vs regular stores

UDAYANGA_W_
Beginner
3,703 Views
Hi,
Is there any performance improvement with non-temporal stores w.r.t regular stores on Xeon Phi? also vectored stores w.r.t. regular store ?  I did some tests on this and results showed otherwise, hope you guys could shed some light on this.
 
I created a simple benchmark (attached) which transfers a large array (i used 8MB and 2GB arrays) to a destination memory address using OMP threads.- 60 threads which were pinned to each core of the Xeon Phi and the transfer time was measured.  Here 3 different store instructions were used - Regular store/ vector(SIMD) store/ vector non-temporal store .Following are the results I observed.
 
Test that read from 'source' and write on ''dest' - read/write BW
 
         time to transfer (us)      BW (GB/s)  
Store Type          8MB             2GB              8MB                2GB
Regular store 62 28912 129.0322581 69.17542889
Vector store 147 48228 54.42176871 41.46968566
Vector NT store 105 33625 76.19047619 59.4795539

 

It looks like vectored stores including Non temporal (NT) is slower and have less throughput than the regular 'store'. It is difficult to explain this result since  at least Vector NT store instructions should ideally save bandwidth and produce a high throughput when message size is sufficiently larger than the cache. Is there any reason for this behavior ? Appreciate your feedback on this

 

0 Kudos
28 Replies
jimdempseyatthecove
Honored Contributor III
750 Views

I think some serious work could be done in the area of prefetching and core architecture. This may only apply (at first) to high-end systems.

Considering the complications with implementing TSX and HLE and compared to what I suggest next and you will not think it out of the realm of possibility. At compiler determined point of the code (and/or via #pragma or intrinsic) a specialty prefetcher thread, invisible to the O/S and user, is activated by the processor. Each hardware thread has a specialty prefetcher thread. At the point of activation, it executes in parallel with the code that activated it, and runs ahead of the normal thread, however it has diminished capacity. It can see and decode all the instructions, however, other than for instructions that manipulate those necessary to produce addressing, the instructions are no-oped other than for cache line fetching. At a closure point in the code, the prefetcher is shut down to conserve power and resources.

This won't necessarily be easy, in light of page faults should they happen. The nice part is prefetching will be performed regardless of TLB misses and such that interfere with an actual memory read. Intel engineers could simulate this to investigate its worthiness (though NIH syndrome may produce some resistance).

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
750 Views

John,

On a different thread on IDZ forums I suggested someone experiment with TSX (I do not have such a system)

The idea would be for the shepard thread, to enter a TSX region and perform a memmove of a block to be prefetched that fits in the transaction buffer, then move it back. Exit the transaction, wait for next request.

Note, RAM will be read and cached but not written ro RAM. The transaction system will (should) undo (elide) the writes.

Do you have a system with TSX?

Jim Dempsey

0 Kudos
TaylorIoTKidd
New Contributor I
750 Views

BKMs: Are there any BKMs from this discussion that can be useful to the community? Casually reading, it seems as if there are: DRAM access & collisions & # of threads executed per core; the relationship between array elements per OpenMP threads; etc.

I encourage you to create a blog that outlines these BKMs in a more concise way. Also, it makes promoting it to the community easier.

DISCUSSION: Great! I've really enjoyed it even as a passive follower.

ASIDE: Jim, your "Chronicles" series is one of the more popular reads on software.intel.com (i.e. it is broadly read).

0 Kudos
jimdempseyatthecove
Honored Contributor III
750 Views

Taylor,

>>ASIDE: Jim, your "Chronicles" series is one of the more popular reads on software.intel.com (i.e. it is broadly read).

Thanks for the feedback. The IDZ blogs page has no mechanism for the poster to see traffic on, nor the community to rank, the articles. As such, it is difficult for me (or other posters I imagine) to determine if they are doing a good job. It took a lot of effort to put together that 5-part series, it would be nice to know if it is being read and appreciated. Some sites do include ratings. Could you try to influence the blogs site manager to see if they could add a ranking system.

Regards,

Jim Dempsey

0 Kudos
TaylorIoTKidd
New Contributor I
750 Views

Hi Jim,

My thoughts exactly. I discussed this briefly with the person in charge of software.intel.com marketing for the MIC community. He agrees that at the very least, we need to have some way of acknowledging the impact of contributions like yours.

I'll continue pursuing this since it is important. I can't promise that anything will happen soon.

Regards
--
Taylor
 

0 Kudos
TimP
Honored Contributor III
750 Views

This thread has come up with interesting information, not necessarily all related to the original question.

If Jim's blogs are getting significant viewing in spite of the difficulty of navigation on that site, I'm impressed; there must be motivated searchers.   I found Jim's initial 3 of his announced 5 part series.  I too would be interested to know what topics engage people.

I've been waiting (too long) to see whether anything would come of my efforts on queuing up for approval to post there (prior to my retirement from Intel), with annual revisions in some cases. It didn't occur to me to ask whether my retirement would remove obstacles. I had in the back of my mind the thought that the site has been overhauled without notice every couple of years, so alternate (non-Intel) sites seem more reliable.

0 Kudos
TaylorIoTKidd
New Contributor I
750 Views

Tim,

Sheesh. Do you still have them (MIC or otherwise)? Send them to me I'll get them through the system and out on the proper forum. I'd always wondered why I saw so few articles from you.

--
Taylor
 

0 Kudos
Chaitali_C_
Beginner
750 Views
This comment has been moved to its own thread
0 Kudos
Reply