It's strange that you get more L2 cache misses even though there are no modified cache line evicted. How much data is shared between threads? Is this read-only data? Have you checked if the prefetchers are hurting you?
Maybe it's time to lock at absolute numbers instead of ratios. I would start with the L2_LD.* and L2_ST.* events.
You can have a look at the events L2_LD.SELF.PREFETCH and compare it with L2_LD.SELF.DEMAND (the misses) and L2_LD.SELF.ANY (both). It might also give you an insight to turn the prefetchers off. (You can do this in the BIOS.)
This might also be a case of "false sharing", i.e. the two threads read and write to distinct values on the same cache line. This cache line then needs to be transferred back and forth between cores. This problem can be fixed by moving the data elements more than 64 Bytes apart to ensure that they are on different cache lines.
The latest version of the "Intel Performance Tuning Utility PTU" can show you with the "memory access" what cache lines you are accessing. PTU is available at http://whatif.intel.com/ free of charge if you have a VTune license.