For Locksandwaits analysis, there are many extra works to monitor Sync Objects, IO waits and Thread APIs in your 1000+ threads application, it will impact on the performance, my opinion is to use Pause/Resume API from ittnotify library. Thus, you just focus on specific interest of code area (time period) to reduce overheads. Read this article.
Using hardware event-based sampling with stack enabling is anothor option, to know context switches, wait time, etc for each function, also you have timeline panel report to know threads' CPU usage info. But there is no CPU time for sync-obj info.