The two approaches are not computing the collision with the same quality, tracing rays will miss intersections, while using rtcCollide can guarantee you to find all intersections.
The slow performance of your rtcCollide implementation is likely in the callback. How do you accumulate all intersections in parallel? Do you filter out self intersections early? Running some performance analyzer (such as VTune) can help locating your issue.
Have a look at the rtcCollide tutorial for example code.