Community
cancel
Showing results for 
Search instead for 
Did you mean: 
softarts
Beginner
51 Views

idea for improveming the performance

the SW architecture like this:

TASK1(Gbps packets input)---TASK2(pre-processed)-TASK3A,TASK3B,TASK3C....(all of them are processing engine)--TASK4(output)

my questions:

1.the packet are coming in continually,so I don't thind it is necessary to keep the cache HOT?(and I don't know how to do that,all packets are alloced in TASK1,and each TASK runs on different CPU,it is not like the TBB:pipeline mode)

2.input packet will be copied to a buf(alloced from a memory pool with malloc),necessary to use cache_aligned_llocator<> to replace the malloc?(again,the content of buf changed quickly and frequently)
0 Kudos
8 Replies
Dmitry_Vyukov
Valued Contributor I
51 Views

There are 2 main patterns of pipeline processing: you can move messages along stages, or you can move stages along messages. The preferred pattern depends on size of data and code associated with messages and with stages. For example, if there is a BIG hash map associated with a stage and messages are really small then you better move your messages along stages, i.e. particular thread always executes single stage and have hash map cached in his cache. And if messages are BIG and stages contain basically no data, then you better move stages along your messages, i.e. particular thread executes ALL stages for single message in succession, and has a message cached in his cache.

Dmitry_Vyukov
Valued Contributor I
51 Views

Locality and keeping caches hot are always important if you care about performance. High packet rate does not matter here, locality can still make order of magnitude difference for execution of a particular stage. Pipelined execution of single instruction takes 0.3 cycles, and if it causes cache miss it can take up to 300 cycles.
Cache aligned allocator will most likely be of help here, if two successive messages are concurrently processed by two different stages (read - threads), and they (messages) share single cache line, this can be a performance disaster (up to 2 orders of magnitude slowdown in the worst case).

softarts
Beginner
51 Views

Quoting - Dmitriy Vyukov
There are 2 main patterns of pipeline processing: you can move messages along stages, or you can move stages along messages. The preferred pattern depends on size of data and code associated with messages and with stages. For example, if there is a BIG hash map associated with a stage and messages are really small then you better move your messages along stages, i.e. particular thread always executes single stage and have hash map cached in his cache. And if messages are BIG and stages contain basically no data, then you better move stages along your messages, i.e. particular thread executes ALL stages for single message in succession, and has a message cached in his cache.

thinking about move message along stages,since the message is coming in consecutively,the cache is invalidated frequently,the cache miss does make sense at this situation?
Dmitry_Vyukov
Valued Contributor I
51 Views

Quoting - softarts
Quoting - Dmitriy Vyukov
There are 2 main patterns of pipeline processing: you can move messages along stages, or you can move stages along messages. The preferred pattern depends on size of data and code associated with messages and with stages. For example, if there is a BIG hash map associated with a stage and messages are really small then you better move your messages along stages, i.e. particular thread always executes single stage and have hash map cached in his cache. And if messages are BIG and stages contain basically no data, then you better move stages along your messages, i.e. particular thread executes ALL stages for single message in succession, and has a message cached in his cache.

thinking about move message along stages,since the message is coming in consecutively,the cache is invalidated frequently,the cache miss does make sense at this situation?

Why not?
Let's assume messages come in at the rate of 1000 msgs/sec. If you have 4 processing stages and each is running on own CPU/core, then there are 4 cache misses per message (figuratively speaking). So 4000 cache misses per second total.
If all 4 processing stages for a particular message are done on a single CPU/code, then there is 1 cache miss per message. So 1000 cache misses per second total.
What makes you think that cache misses do not make sense in this particular situation? There is still "processing efficiency" for each and every message, no matter how fast they flow through the program.

softarts
Beginner
51 Views

Quoting - Dmitriy Vyukov

Why not?
Let's assume messages come in at the rate of 1000 msgs/sec. If you have 4 processing stages and each is running on own CPU/core, then there are 4 cache misses per message (figuratively speaking). So 4000 cache misses per second total.
If all 4 processing stages for a particular message are done on a single CPU/code, then there is 1 cache miss per message. So 1000 cache misses per second total.
What makes you think that cache misses do not make sense in this particular situation? There is still "processing efficiency" for each and every message, no matter how fast they flow through the program.


but if msg move across each task(assume 1 task/per thread),thenone msghas to flush each core's cache
Dmitry_Vyukov
Valued Contributor I
51 Views

Quoting - softarts
Quoting - Dmitriy Vyukov

Why not?
Let's assume messages come in at the rate of 1000 msgs/sec. If you have 4 processing stages and each is running on own CPU/core, then there are 4 cache misses per message (figuratively speaking). So 4000 cache misses per second total.
If all 4 processing stages for a particular message are done on a single CPU/code, then there is 1 cache miss per message. So 1000 cache misses per second total.
What makes you think that cache misses do not make sense in this particular situation? There is still "processing efficiency" for each and every message, no matter how fast they flow through the program.


but if msg move across each task(assume 1 task/per thread),thenone msghas to flush each core's cache

Sometimes it's possible to let 1 thread execute several stages. Then there will be no "flushes".

softarts
Beginner
51 Views

Quoting - Dmitriy Vyukov

Sometimes it's possible to let 1 thread execute several stages. Then there will be no "flushes".


1.when msg moves along each thread,the msg will be cached in L2 cache?(L2 is sharedwhithin cores)
2.when 1 thread execute several stages,the msg is possibly cached in L1 cache? (because they executed in same core?)

Dmitry_Vyukov
Valued Contributor I
51 Views

Quoting - softarts
1.when msg moves along each thread,the msg will be cached in L2 cache?(L2 is sharedwhithin cores)
2.when 1 thread execute several stages,the msg is possibly cached in L1 cache? (because they executed in same core?)


Regarding shared caches, it depends. On some CPU L2$ is shared, on some L3$ is shared, and some do not feature shared caches. Some CPU features inclusive caches, and some - exclusive.
Regarding messages, queues between stages are usually FIFO, and FIFO is the best way to cold data. So if a message is passed through a FIFO queue, it's reasonable to assume that it's not in cache any more.
Here is another moment, even if CPU features shared L2$ and caches are inclusive, then CPU2 still have to issue RFO (request for ownership) to CPU1 in order to invalidate copy in L1D$. Performance-wise it's equal to situation when message is not in cache at all (RFO is costly).

So if you execute stage 2 for a message straight after stage 1 and on the same CPU then you are on the safe side. Otherwise it highly depends.

Reply