Hello, are embeddings shared between multiple streams (CPU_THROUGHPUT_STREAMS)? With benchmark_app, I'm able to run with 56 streams, 56 input requests on a Xeon 8280 with 768GB using a 12GB model, but with a 45GB model I can only run with 4 streams. Any larger and it just reports "Killed" when trying to load the model:
[Step 7/11] Loading the model to the device
If multiple streams indeed use independent embeddings, is there any way to change them to share a single copy of the embeddings, or maybe ideally per-socket copies?
Currently, we do not have any information in our OpenVINO documentation pertaining embedding shared between different streams but based on our findings, we are sharing the additional information on the potential causes on why this might happen.
Looking at the symptom shared, this might be caused by higher batch size that caused latency vs. throughput performance as in link: Introduction to the Performance Topics - OpenVINO™ Toolkit
We suggest for you to utilize/refer one of the following solutions:
1) Include -nstream in command-line parameter.
Try different values of the -nstreams argument from 1 to a number of CPU cores and find one that provides the best performance. For example, on a 8-core CPU, compare the -nstreams 1 (which is a latency-oriented scenario) to the 2, 4 and 8 streams. The benchmark_app will automatically queries/creates/runs number of requests required to saturate the given number of streams.
2) Utilize CPU plugins for high performance scoring of neural networks
KEY_CPU_THROUGHPUT_STREAMS - Specifies number of CPU "execution" streams for the throughput mode. Upper bound for the number of inference requests that can be executed simultaneously. All available CPU cores are evenly distributed between the streams. The default value is 1, which implies latency-oriented behavior with all available cores processing requests one by one.
KEY_CPU_THROUGHPUT_NUMA creates as many streams as needed to accommodate NUMA and avoid associated penalties.
KEY_CPU_THROUGHPUT_AUTO creates bare minimum of streams to improve the performance; this is the most portable option if you don't know how many cores your target machine has (and what would be the optimal number of streams). Note that your application should provide enough parallel slack (for example, run many inference requests) to leverage the throughput mode.
Non-negative integer value creates the requested number of streams. If a number of streams is 0, no internal streams are created and user threads are interpreted as stream master threads.
3) Discover and Explore Optimal Combination of Streams and Batches with DL Workbench
Intel will no longer monitor this thread since we have provided a solution. If you need any additional information from Intel, please submit a new question.