TL;DR: How are you supposed to profile enclave code? Existing profilers don't work even in debug mode.
We would like to use hardware performance counters (with Linux perf) to optimise the performance of an application running inside an enclave. Unfortunately, even in debug mode we cannot get precise measurements of the hotspots inside the enclave: perf attributes the counters to the __morestack() function, called when entering the enclave, instead of the function executing inside the enclave.
Is there a way, using the Intel SGX SDK, to circumvent this problem? We tried to use Intel Vtune but got the same problem.
Another problem we faced is that some performance counters are unreliable inside the enclave. We designed a micro-benchmark where 2 threads are accessing their own cache line.
The L2 cache miss ratio drops from 0.0779 without SGX to 0.0014 with SGX, which doesn't make sense: it should be at least equal to the one without SGX as using SGX does not reduce the cache usage.
The CPI increases from 1.3 to 11.4. Even if it seems high this number makes sense as the sgx-based program is slower than the non-sgx one (partly due to TLB flushes which I don't think are taken into account by the counter of retired instructions).
Our CPU is an Intel Xeon CPU E3-1280 v5 @ 3.70GHz, running the Linux kernel v3.19 (Ubuntu 14.04), with the Intel SDK v1.5.
As you can imagine, lack of performance counter information in debug mode makes it extremely difficult to optimise enclave code. Is this just an issue with perf's sgx support or is there a more fundamental limitation (e.g. sampling the perf counter requires an interrupt which results in an async exit, skewing the results?)
In addition to creating the enclave in debug mode when calling sgx_create_enclave, you also need to modify each TCS to update the DBGOPTIN bit.
Here's a previous discussion that talks about some of the process:
and line 192-ish of https://github.com/01org/linux-sgx/blob/85947caa128f9f5c731cb25c3cdc4a4d5f95d6e7/psw/urts/urts_com.h has some of the details.
If you are profiling using VTune then the run-time enables this functionality. You should be able to modify the code if you want it to always allow profiling.
BTW, did you try version 1.6 of the Linux SDK or only 1.5?