I am trying to increase my encode performance, and am wanting to understand the optimal operation sequence of Encode() and SyncOperation() calls. The sdk manual notes:
For performance considerations, the application must submit multiple operations and delays synchronization as much as possible, which gives the SDK flexibility to organize internal pipelining. For example, the operation sequence, ENCODE(f1) ENCODE(f2) SYNC(f1) SYNC(f2) is recommended, compared with ENCODE(f1) SYNC(f1) ENCODE(f2) SYNC(f2).
Suppose I have multiple streams A and B to encode simultaneously and I submit them as follows:
- Encode(A1), Encode(B1), Encode(A2), Encode(B2)
My possible SyncOperation sequences are as follows:
#1. Sync(A1), Sync(A2), Sync(B1), Sync(B2)
#2. Sync(A1), Sync(B1), Sync(A2), Sync(B2)
Will the performance be the same? #1 better? #2 better? Does it matter if they are a joined session or not (they are not currently JoinSession'ed)?
For optimal performance you must design your pipelines so that several codec operations are "in-flight" at the same time. As you state, this can be done by invoking several Encode calls before calling Sync.
Such task oriented usage is illustrated both in the Media SDK "sample_encode" sample and the Media SDK tutorial "simple_3_encode_d3d_async" sample.
Regarding handling of several concurrent streams. To limit implementation complexity, our recommendation is to host each stream pipeline is separate thread. An example for this usage can be found in Media SDK tutorial sample "simple_6_transcode_opaque - async - vppresize – multi".
The primary use for "JoinSession" is for the case of using SW codec, to avoid CPU thread over subscription.
Thanks for the response. Suppose I have the option of putting 2 operations in flight from either a single stream (2 operations from 1 stream), or from 2 streams (1 operation from each of 2 streams). Will there be a difference in performance?
A related question is will there be performance benefits to batching Sync calls across concurrent streams?
If I understand you description correctly, there should be no performance difference between the two modes you describe.
Regarding the question about "batching" Sync for the case when processing many concurrent streams. As the number of concurrent streams increase the need (to achieve high performance and utilize the GPU optimally) for batching Sync calls becomes less important due to the fact that the individual pipelines will use different parts of the GPU at different times and due to the greater load also keeping the processor in turbo mode consistently.