I've asked some experts their opinions on this, and either I or they will post their thoughts here, hopefully in the next few days.
HOWEVER, that said, my guess is that, depending on what workloads you're running, your data may actually fit in L2 but spills out of L3. (Sounds pretty simple: maybe *too* simple, eh?...)
An engineer I ran this by quickly suggested that there is a push to less-recently used data from L2 into L3; but whether or not the compiler can influence this, or it's a strictly hardware-thing, we're not remembering.
Stay tuned! The experts will weigh in here, and set us all straight.
Thanks jdgallag, I think I understand the much higher cache misses of L3 cache. the reason is that: when you bring data from memory to caches, you bring it to both L2 and L3 (then half of L3 cache is poluted when all entries in L2 is renewed, since L3 1M L2 512K), and since L2 miss is low and when L3 cache is referenced, it's very likely that it's a brand new data(Most data in L3 are the "once upon a time" old data in L2, caused by cache misses). So, L3 cache effectiveness are very low. The main reason is that L3 is not too much bigger than L2 (only two times bigger), compare with the difference of L1 and L2 (512/20).
What I still don't understand is the first question, why L3 references > L2 cache misses