Sorry to say, there's lots wrong with this code so I'm not surprised that you never saw more than one HW thread executing concurrently in it. Here's a few hints.
The big thing to consider when trying to write a parallel version of algorithms is what bits can be done without interfering with other bits. This dependence analysis helps to figure out what data structures can be safely and concurrently accessed and which cannot. That's where the art of parallel programming lives these days. If you've got the funds, I'd invest in a parallel programming book to learn the basics. I'm sure there are several people on this forum who can suggest specific titles.