Please allow me to ask some questions for clarification:
1) Are you doing multi-threading (shared-memory) parallelization?
2) Would you like to parallelize matrix-vector multiplication by doing multiple dspmv calls from different threads?
I cannot think of an easy way to achieve (2). Also, matrix-vector multiplication for your problem sizes may show poor scaling for single-socket systems.