"The old method was to write all the data into one file. That takes about 1.1 seconds to run. I split it into many files and tried to parallelize the process to speed it up. But now it takes 15 seconds." - thank you to tell me this story.
I suspect there were more CPU time spent in disk write I/O, you can useLocksAndWaits analysis to collect performance data to find Wait Time.
Frequently disk-writing in parallel threading is not helpful on performance gain, since only one thread can use disk IO at a time, you may use memory storage instead of disk IO, dump data to files in final stage (Sorry that I don't know your algorithm in depth)