Solved: About GDDR5 memory channels on Coprocessors and user processes getting killed

Matara_Ma_S_ · ‎10-12-2014

Hi all,

Actually, I am asking this more out of cruosity than need.

I am using native mode on Xeon Phi coprocessor card (60 cores). What happens when cores issue too many memory writes? Normally (to my knowledge) they get queued in one of 8 GDDR5 memory channels. Is it possible for them to overflow? And what will happen (if/when) this occurs?

The situation that raised my cruosity;

I was running stl qsort on a single card (in native mode). When the array structure I use gets too big, process was getting killed. I thought, maybe this is a time issue. Then, I implemented a parallel version of this, using hybrid merge & quick sort. Process still gets killed (altough it was much faster). The array I was sorting was globally clustered/sorted but needed local sorts. Thinking my parallel quicksort based implementation would not work efficiently in such a case, I used multiple local quick sort operations. And problem vanished. All I did was reduce the number of swaps therefore memory writes (I think - because I used malloc to allocate the array structure).

I think I am now facing the same thing in different segments in my code. But plan to re-implement efficiently (mostly because I bugs me). But I would like to learn what happens in cases where extreme amount of memory IO operations are queued. Any insight (not necessarily on this specific problem) is appreciated.

Thanks in advance & Thanks for taking the time

Matara Ma Sukoy

McCalpinJohn · ‎10-13-2014

Implementations vary, but any processor that actually works must support flow-control that is capable of "backing-up" all the way to the core.

At a very high level on the Xeon Phi (and being very loose with nomenclature):

If the Memory Controller store buffers are full, they will not accept write transactions from the DTDs.
- (The DTDs are responsible for generating cache coherence transactions and memory requests based on L2 requests.)
- This will cause the DTD store buffers to fill up.
If the DTD store buffers are full, they will not accept write transactions from the L2 caches.
- This will cause the L2 cache write buffers to fill up.
If the L2 cache write buffers are fill, it will not accept writes from the L1 cache.
- This will cause the L1 cache write buffers to fill up.
If the L1 cache write buffers are full, it will not accept stores from the core.
- This will cause the core's store buffers to fill up.
If the core's store buffers are full, store instructions cannot execute.
- On Xeon Phi, other threads may be able to execute instructions.
Any thread with pending store instructions will simply spin (in hardware) until a store buffer entry opens up -- then it can will execute the store and move on to the next instruction.

If not detected, a failure of this protocol would either hang the system or result in incorrect data being written to memory.

If detected, a failure of this protocol would cause a Machine Check Error (MCE) and immediately halt the processor.

If your job is being killed, it is highly unlikely to be due to a hardware protocol error. Your description sounds more like the fairly ordinary occurrence of a task being killed due to an out-of-memory error.

View solution in original post

jimdempseyatthecove · ‎10-13-2014

I doubt if it is a write queue issue, the stream test benchmarks would have uncovered this. I suggest you look at programming error, buffer overwrite, malloc returning NULL and you not testing for it, using vectors on unaligned allocations, referencing a buffer after deletion, etc...

Jim Dempsey

McCalpinJohn · ‎10-13-2014

Implementations vary, but any processor that actually works must support flow-control that is capable of "backing-up" all the way to the core.

At a very high level on the Xeon Phi (and being very loose with nomenclature):

If the Memory Controller store buffers are full, they will not accept write transactions from the DTDs.
- (The DTDs are responsible for generating cache coherence transactions and memory requests based on L2 requests.)
- This will cause the DTD store buffers to fill up.
If the DTD store buffers are full, they will not accept write transactions from the L2 caches.
- This will cause the L2 cache write buffers to fill up.
If the L2 cache write buffers are fill, it will not accept writes from the L1 cache.
- This will cause the L1 cache write buffers to fill up.
If the L1 cache write buffers are full, it will not accept stores from the core.
- This will cause the core's store buffers to fill up.
If the core's store buffers are full, store instructions cannot execute.
- On Xeon Phi, other threads may be able to execute instructions.
Any thread with pending store instructions will simply spin (in hardware) until a store buffer entry opens up -- then it can will execute the store and move on to the next instruction.

If not detected, a failure of this protocol would either hang the system or result in incorrect data being written to memory.

If detected, a failure of this protocol would cause a Machine Check Error (MCE) and immediately halt the processor.

If your job is being killed, it is highly unlikely to be due to a hardware protocol error. Your description sounds more like the fairly ordinary occurrence of a task being killed due to an out-of-memory error.

Matara_Ma_S_ · ‎10-13-2014

Hi Mr. Dempsey,

You are right. I shouldn't be this carefree about system resources anymore and start checking return values from memory allocations. And I happened to look into the quick thread programming from your link, it won me when I saw there are 2 seperate thread pools (Computation and IO) - don't know about the difference it makes though.

---------------------------

Hi Dr. Bandwidth,

Thank you so much. I was thinking "where does this end...". I better start paying more attention to tuning & observation.

---------------------------

Thanks for taking the time.

Regards
Matara Ma Sukoy

jimdempseyatthecove · ‎10-13-2014

Matara,

The dual thread pools are particularly effective when some of the threads block on I/O (or event). The design has an I/O class that runs at elevated priority. The cooperative way of programming is to specify the number of I/O class threads at, at least, the number of expected blocking situations. The number of compute threads is generally set to the number of available hardware threads. IOW you are "oversubscribed" by the number of I/O threads. You then write the application with I/O tasks which typically perform very little work and mostly wait for I/O completion, event signal, or task from task scheduler.

A good example of use is a parallel pipeline where the input stage reads from a file, the interior stages are all compute bound, and the output stage writes to a file. Trying to do this in OpenMP, TBB, or CilkPlus is problematic since all threads are more or less equal. Though TBB has thread arenas, and OpenMP has a Task set it is still somewhat of a kludge to efficiently divvy up the work.

QuickThread needs a refresher to support some of the newer hardware features. It hasn't been tested with AVX, AVX512, and it would be nice to update for the upcoming TSX and RTM.

Jim Dempsey