Re: idea for improveming the performance

softarts · ‎11-23-2009

the SW architecture like this:

TASK1(Gbps packets input)---TASK2(pre-processed)-TASK3A,TASK3B,TASK3C....(all of them are processing engine)--TASK4(output)

my questions:

1.the packet are coming in continually,so I don't thind it is necessary to keep the cache HOT?(and I don't know how to do that,all packets are alloced in TASK1,and each TASK runs on different CPU,it is not like the TBB:pipeline mode)

2.input packet will be copied to a buf(alloced from a memory pool with malloc),necessary to use cache_aligned_llocator<> to replace the malloc?(again,the content of buf changed quickly and frequently)

Dmitry_Vyukov · ‎11-23-2009

There are 2 main patterns of pipeline processing: you can move messages along stages, or you can move stages along messages. The preferred pattern depends on size of data and code associated with messages and with stages. For example, if there is a BIG hash map associated with a stage and messages are really small then you better move your messages along stages, i.e. particular thread always executes single stage and have hash map cached in his cache. And if messages are BIG and stages contain basically no data, then you better move stages along your messages, i.e. particular thread executes ALL stages for single message in succession, and has a message cached in his cache.

Dmitry_Vyukov · ‎11-23-2009

Locality and keeping caches hot are always important if you care about performance. High packet rate does not matter here, locality can still make order of magnitude difference for execution of a particular stage. Pipelined execution of single instruction takes 0.3 cycles, and if it causes cache miss it can take up to 300 cycles.
Cache aligned allocator will most likely be of help here, if two successive messages are concurrently processed by two different stages (read - threads), and they (messages) share single cache line, this can be a performance disaster (up to 2 orders of magnitude slowdown in the worst case).

softarts · ‎12-10-2009

Quoting - Dmitriy Vyukov

There are 2 main patterns of pipeline processing: you can move messages along stages, or you can move stages along messages. The preferred pattern depends on size of data and code associated with messages and with stages. For example, if there is a BIG hash map associated with a stage and messages are really small then you better move your messages along stages, i.e. particular thread always executes single stage and have hash map cached in his cache. And if messages are BIG and stages contain basically no data, then you better move stages along your messages, i.e. particular thread executes ALL stages for single message in succession, and has a message cached in his cache.

thinking about move message along stages,since the message is coming in consecutively,the cache is invalidated frequently,the cache miss does make sense at this situation?

Dmitry_Vyukov · ‎12-10-2009

Quoting - softarts

Quoting - Dmitriy Vyukov

There are 2 main patterns of pipeline processing: you can move messages along stages, or you can move stages along messages. The preferred pattern depends on size of data and code associated with messages and with stages. For example, if there is a BIG hash map associated with a stage and messages are really small then you better move your messages along stages, i.e. particular thread always executes single stage and have hash map cached in his cache. And if messages are BIG and stages contain basically no data, then you better move stages along your messages, i.e. particular thread executes ALL stages for single message in succession, and has a message cached in his cache.

thinking about move message along stages,since the message is coming in consecutively,the cache is invalidated frequently,the cache miss does make sense at this situation?

Why not?
Let's assume messages come in at the rate of 1000 msgs/sec. If you have 4 processing stages and each is running on own CPU/core, then there are 4 cache misses per message (figuratively speaking). So 4000 cache misses per second total.
If all 4 processing stages for a particular message are done on a single CPU/code, then there is 1 cache miss per message. So 1000 cache misses per second total.
What makes you think that cache misses do not make sense in this particular situation? There is still "processing efficiency" for each and every message, no matter how fast they flow through the program.

softarts · ‎12-11-2009

Quoting - Dmitriy Vyukov

Why not?
Let's assume messages come in at the rate of 1000 msgs/sec. If you have 4 processing stages and each is running on own CPU/core, then there are 4 cache misses per message (figuratively speaking). So 4000 cache misses per second total.
If all 4 processing stages for a particular message are done on a single CPU/code, then there is 1 cache miss per message. So 1000 cache misses per second total.
What makes you think that cache misses do not make sense in this particular situation? There is still "processing efficiency" for each and every message, no matter how fast they flow through the program.

but if msg move across each task(assume 1 task/per thread),thenone msghas to flush each core's cache

Dmitry_Vyukov · ‎12-11-2009

Quoting - softarts

Quoting - Dmitriy Vyukov

Why not?
Let's assume messages come in at the rate of 1000 msgs/sec. If you have 4 processing stages and each is running on own CPU/core, then there are 4 cache misses per message (figuratively speaking). So 4000 cache misses per second total.
If all 4 processing stages for a particular message are done on a single CPU/code, then there is 1 cache miss per message. So 1000 cache misses per second total.
What makes you think that cache misses do not make sense in this particular situation? There is still "processing efficiency" for each and every message, no matter how fast they flow through the program.

but if msg move across each task(assume 1 task/per thread),thenone msghas to flush each core's cache

Sometimes it's possible to let 1 thread execute several stages. Then there will be no "flushes".

softarts · ‎12-15-2009

Quoting - Dmitriy Vyukov

Sometimes it's possible to let 1 thread execute several stages. Then there will be no "flushes".

1.when msg moves along each thread,the msg will be cached in L2 cache?(L2 is sharedwhithin cores)
2.when 1 thread execute several stages,the msg is possibly cached in L1 cache? (because they executed in same core?)

Dmitry_Vyukov · ‎12-15-2009

Quoting - softarts

1.when msg moves along each thread,the msg will be cached in L2 cache?(L2 is sharedwhithin cores)
2.when 1 thread execute several stages,the msg is possibly cached in L1 cache? (because they executed in same core?)

Regarding shared caches, it depends. On some CPU L2$ is shared, on some L3$ is shared, and some do not feature shared caches. Some CPU features inclusive caches, and some - exclusive.
Regarding messages, queues between stages are usually FIFO, and FIFO is the best way to cold data. So if a message is passed through a FIFO queue, it's reasonable to assume that it's not in cache any more.
Here is another moment, even if CPU features shared L2$ and caches are inclusive, then CPU2 still have to issue RFO (request for ownership) to CPU1 in order to invalidate copy in L1D$. Performance-wise it's equal to situation when message is not in cache at all (RFO is costly).

So if you execute stage 2 for a message straight after stage 1 and on the same CPU then you are on the safe side. Otherwise it highly depends.