I'm using Eigen to test BLAS operation overhead within the Enclave. I simply divided two 1GB arrays into small blocks, and using Eigen's dot product inside the enclave.
The results I got shows about 46% overhead for exactly the same dot product operation within and outside the enclave.
I divided the block size small enough that it will not cause L3 cache miss. I have used measurement tools to guarantee there are very little cash misses within the enclave. Also, the enclave heap size is smaller than 128MB, such that page swap is not going to happen.
I've used VTune to check the execution process, almost every function (padd, pmul, ploadu) of Eigen's dot production is slower than running outside the enclave.
Could you help me figure out what could be the possible reason to cause this slow down?
So far as I know, SGX's overhead are:
(1) Enclave creation/deletion [excluded in my case]
(2) Cache line encryption when writing to DRAM [very small in my case]
(3) Page swap [none in my case]
(4) Calling system calls outside the enclave [none in my case]
I couldn't match any of those to my case.
Thank you so much!
I figured out what the problem was: I was using DEBUG mode to evaluate SGX's performance.
Once I switched to PRERELEASE mode, the results suddenly become way better.
And now I started to have SGX programs outperform outside programs just as discussed in https://software.intel.com/en-us/forums/intel-software-guard-extensions-intel-sgx/topic/721425.
Okay, the reason why SGX program outperforms outside programs is that SGX C/C++ libs automatically uses AVX.
After making the outside program uses AVX (by simply adding -march=native), the two performance are now pretty close.
This means that, without cache miss and page swap overhead, (and enclave creation/ OCALL overhead) SGX program will not introduce much noticeable overhead,