Question about MIC performance of vector(SIMD/non-temporal) vs regular stores - Page 2

UDAYANGA_W_ · ‎06-19-2014

Hi,

Is there any performance improvement with non-temporal stores w.r.t regular stores on Xeon Phi? also vectored stores w.r.t. regular store ? I did some tests on this and results showed otherwise, hope you guys could shed some light on this.

I created a simple benchmark (attached) which transfers a large array (i used 8MB and 2GB arrays) to a destination memory address using OMP threads.- 60 threads which were pinned to each core of the Xeon Phi and the transfer time was measured. Here 3 different store instructions were used - Regular store/ vector(SIMD) store/ vector non-temporal store .Following are the results I observed.

Test that read from 'source' and write on ''dest' - read/write BW

	time to transfer (us)		BW (GB/s)
Store Type	8MB	2GB	8MB	2GB
Regular store	62	28912	129.0322581	69.17542889
Vector store	147	48228	54.42176871	41.46968566
Vector NT store	105	33625	76.19047619	59.4795539

It looks like vectored stores including Non temporal (NT) is slower and have less throughput than the regular 'store'. It is difficult to explain this result since at least Vector NT store instructions should ideally save bandwidth and produce a high throughput when message size is sufficiently larger than the cache. Is there any reason for this behavior ? Appreciate your feedback on this

jimdempseyatthecove · ‎06-26-2014

I think some serious work could be done in the area of prefetching and core architecture. This may only apply (at first) to high-end systems.

Considering the complications with implementing TSX and HLE and compared to what I suggest next and you will not think it out of the realm of possibility. At compiler determined point of the code (and/or via #pragma or intrinsic) a specialty prefetcher thread, invisible to the O/S and user, is activated by the processor. Each hardware thread has a specialty prefetcher thread. At the point of activation, it executes in parallel with the code that activated it, and runs ahead of the normal thread, however it has diminished capacity. It can see and decode all the instructions, however, other than for instructions that manipulate those necessary to produce addressing, the instructions are no-oped other than for cache line fetching. At a closure point in the code, the prefetcher is shut down to conserve power and resources.

This won't necessarily be easy, in light of page faults should they happen. The nice part is prefetching will be performed regardless of TLB misses and such that interfere with an actual memory read. Intel engineers could simulate this to investigate its worthiness (though NIH syndrome may produce some resistance).

Jim Dempsey

jimdempseyatthecove · ‎06-26-2014

John,

On a different thread on IDZ forums I suggested someone experiment with TSX (I do not have such a system)

The idea would be for the shepard thread, to enter a TSX region and perform a memmove of a block to be prefetched that fits in the transaction buffer, then move it back. Exit the transaction, wait for next request.

Note, RAM will be read and cached but not written ro RAM. The transaction system will (should) undo (elide) the writes.

Do you have a system with TSX?

Jim Dempsey

TaylorIoTKidd · ‎06-30-2014

BKMs: Are there any BKMs from this discussion that can be useful to the community? Casually reading, it seems as if there are: DRAM access & collisions & # of threads executed per core; the relationship between array elements per OpenMP threads; etc.

I encourage you to create a blog that outlines these BKMs in a more concise way. Also, it makes promoting it to the community easier.

DISCUSSION: Great! I've really enjoyed it even as a passive follower.

ASIDE: Jim, your "Chronicles" series is one of the more popular reads on software.intel.com (i.e. it is broadly read).

jimdempseyatthecove · ‎07-01-2014

Taylor,

>>ASIDE: Jim, your "Chronicles" series is one of the more popular reads on software.intel.com (i.e. it is broadly read).

Thanks for the feedback. The IDZ blogs page has no mechanism for the poster to see traffic on, nor the community to rank, the articles. As such, it is difficult for me (or other posters I imagine) to determine if they are doing a good job. It took a lot of effort to put together that 5-part series, it would be nice to know if it is being read and appreciated. Some sites do include ratings. Could you try to influence the blogs site manager to see if they could add a ranking system.

Regards,

Jim Dempsey

TaylorIoTKidd · ‎07-01-2014

Hi Jim,

My thoughts exactly. I discussed this briefly with the person in charge of software.intel.com marketing for the MIC community. He agrees that at the very least, we need to have some way of acknowledging the impact of contributions like yours.

I'll continue pursuing this since it is important. I can't promise that anything will happen soon.

Regards
--
Taylor

TimP · ‎07-01-2014

This thread has come up with interesting information, not necessarily all related to the original question.

If Jim's blogs are getting significant viewing in spite of the difficulty of navigation on that site, I'm impressed; there must be motivated searchers. I found Jim's initial 3 of his announced 5 part series. I too would be interested to know what topics engage people.

I've been waiting (too long) to see whether anything would come of my efforts on queuing up for approval to post there (prior to my retirement from Intel), with annual revisions in some cases. It didn't occur to me to ask whether my retirement would remove obstacles. I had in the back of my mind the thought that the site has been overhauled without notice every couple of years, so alternate (non-Intel) sites seem more reliable.

TaylorIoTKidd · ‎07-01-2014

Tim,

Sheesh. Do you still have them (MIC or otherwise)? Send them to me I'll get them through the system and out on the proper forum. I'd always wondered why I saw so few articles from you.

--
Taylor

Chaitali_C_ · ‎05-19-2015

This comment has been moved to its own thread