What is your FFT order? General speaking,
for small FFT orders ( for example, float complex FFT < ~19 depends on platform (cache size)), there is no difference between in-place and out-of-place cases performance. Because for such orders, the memory buffer used by FFT function is small (~equal to vector length) and can be in-cache, FFT is calculated in the buffer and then result is copied to the destination so for in-cache cases it doesnt matter where to copy to src or to dst vector.
For rather large orders (>19) in-place version is faster because internally FFT uses buffer of smaller size (less than input vector length).
Related discussion is also in <<>>