Cache misses are relatively hard to interpret on recent Intel processors because of the aggressive hardware prefetchers.
With STREAM, the number of cache *fills* and cache *writebacks* should be the same for local and remote data, but you have to be careful with whether the counters you are using are measuring all LLC misses, or just LLC misses due to load/store instructions.
If I recall correctly, performance counter event 2Eh, mask 41h only counts LLC misses due to loads & stores, not LLC misses due to hardware prefetches.
(1) You might be able to get the desired numbers by properly programming Event B7h (or BBh) along with the correct bits in MSR 01A6h (or MSR 01A7h). The programming of these bits is very confusing, but there are examples in Table 19-5 of Volume 3B of the Intel Architecture software developer's manual. (I am using document 325384-042.) My understanding is that this is supported by the perf_events subsystem of the Linux kernel starting with version 3.3, but I have not confirmed this. Because this is a core performance counter measurement, you would have to duplicate the measurement on all cores to get the total bandwidth.
(2) You can measure L3 misses and L3 writebacks using the L3 CBO box uncore counters. The Xeon E5-4620 is similar to the Xeon E5-2600 series, so you should be able to make use of the information in Intel's document "Intel Xeon Processor E-2600 Product Family Uncore Performance Monitoring Guide", document 327043-001. Table 2-15 includes the derived metrics required to obtain local and remote read bandwidth and L3 writeback bandwidth. This is not much easier than using the core counters, since there is one CBox per core and you have to read the counters for all of them to get the total bandwidth, but at least it is available using an MSR interface.
(3) You can also measure DRAM bandwidth directly using the performance counters in the integrated Memory Controller (iMC) in the uncore. There are four memory controllers that you have to query, but the performance monitors are in PCI Configuration Space, rather than in MSR space.
I have had success setting these using the "setpci" utility in our RHEL 6.2 Linux systems, and have read the results using both "setpci" and using "lspci". It requires root access and is labor intensive in all cases, but it definitely works. I have not seen any indication that the Linux perf_events subsystem is going to provide access to either the MSR-based uncore events or the PCI configuration-space uncore events any time soon.
A final note -- the large L3 caches in the Intel Xeon E5 processors mean that you have to increase the array sizes in the STREAM benchmark to obtain useful (and compliant) results. The STREAM run rules require that each array be 4x as large as the aggregate of the largest caches used. On my Xeon E5-2680 2-socket systems, the L3 is 20 MB per socket, so the array sizes need to be at least 80 MB (N=10,000,000) each for single-socket runs and 160MB (N=20,000,000) each for runs using both sockets. The default array size of N=2,000,000 results in L3-contained cases for the "Copy" and "Scale" kernels when run on two (or more) sockets, since each array is only 15.26 MiB. To measure L3 bandwidth, you would want slightly smaller array sizes (so that the kernels using all three arrays don't overflow the L3 caches), while to measure DRAM bandwidth you want much larger arrays to eliminate the possibility of significant L3 cache re-use biasing the results.