Thanks for your reply.
I must assume that the routine calling the sgemm will be threaded by the user: the expected application is for speech recognition for Telecom servers, so it's likely to have 32 to 64 channels running in parallel, each one doing it's sgemm. I have to compute one 'forward' pass in neural net (2 matrix-matrix multiplication, typical size being [1 to 4,200] by [200,1000] for the first layer) every 30 ms.
If I assumed no threading in the routine, my standard matrix code is more than fast enough :-)
Message Edited by Laurent-Zanoni on 04-20-2004 01:22 AM