Detailed documentation for hardware performance monitors is mostly non-existent, but the brief descriptions are relatively easy to find.
In the directory tree where the "Amplifier XE" product is installed, look for a file named "snb_db.txt". I can't remember the name of the subdirectory where I found it, but the name ought to be relatively stable. This file is a text file with very wide lines that contains the information needed to map from the event names used by VTune to the event numbers, masks, and other configuration bits that are described in the hardware documentation (details below).
In the "snb_db.txt" file, the third item in each line is the event name. Search for the event name you are interested in, then get the Event Number from the first entry in the line and the Umask from the second entry in the line. Some/many entries will have auxiliary information in later columns, but it is probably best to start with the easy cases.
Now that you have the Event Number and Umask, you get to look up the event description in the hardware documentation. The performance counter events are described in the document called "Intel 64 and IA-32 Architectures Software Developers Manual, Volume 3: System Programming Guide" (document 325384, revision 047, June 2013). You should search for this on the Intel web site so that you are assured of getting the most recent version.
The performance monitoring infrastructure is described in chapter 18, while the performance monitoring events are described in chapter 19.
For the Xeon E3 (Sandy Bridge), the performance monitoring events are described in section 19.4 -- mostly in Table 19-7, but some additional information is in Tables 19-8, 19-9, and 19-10, depending on exactly which Xeon E3 processor you are dealing with. Most of the Xeon E3's are referred to as DisplayFamily_DisplayModel 06_2AH, but the high-end models with four memory channels are referred to as 06_2DH (same as Xeon E5).
The performance monitoring events are sorted by Event Number in the hardware documentation. (The names for the events are often the same as those used by VTune, but are not always the same, and the line breaks in the PDF tables can make it frustrating to search by name.)
As an example, if you are interested in the event that VTune refers to as "ICACHE.MISSES", you look in the "snb_db.txt" file and find it on line 28. The first entry in that line tells you that the event number is 0x80 and the second entry tells you that the Umask is 0x02, so you go to Table 19-7 of Vol 3 of the SW Developers Guide and scroll down until you reach the entry with "80H" in the first column. There is only only entry for this event, which fortunately has the correct Umask "02H". In this case the event name is the same in VTune and in the HW document, and the event description is:
"Number of Instruction Cache, Streaming Buffer and Victim Cache Misses. Includes UC accesses."
Most of the time that is all you are going to get. Sometimes there are discussions of the details of particular events in this forum, but more often you have to combine the description from Vol 3 of the SW developer's guide with the description of the microarchitecture in the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966, version 028, July 2013). The Sandy Bridge microarchitecture is presented in section 2.2. The description has four important parts:
- The "instruction cache" part of the description is fairly straightforward and is discussed in section 2.2.2.
- The "streaming buffer" is not referred to by that name, but appears to correspond to the "Micro-op Queue and Loop Stream Detector (LSD)" of section 18.104.22.168.
- I did not find any references to the "victim cache" in the discussion of the Sandy Bridge architecture. Sometimes that means that the feature is described in the context of an earlier architecture, sometimes the feature is not actually present but the documentation incorrectly cut-n-pasted the description from an earlier processor description that did have the feature, and sometimes the feature is present but there is no additional documentation anywhere.
- "Includes UC accesses" means that this event also counts instruction loads from uncached address ranges. It is arguable whether these should be called "cache misses", since the corresponding addresses are not allowed to be cached, so it is very helpful when descriptions clarify these ambiguous cases. In this case I would expect no uncached instruction loads once the processor has exited the boot process, but there could be use cases in system management code that I am not aware of. (E.g., a machine check error handler might decide that the instruction cache could be corrupted and load code directly from an uncached region of ROM.)
Even after cross-referencing all the published documents, I typically find that I have to write carefully controlled microbenchmark codes to understand the hardware performance counter events in detail. Sometimes I write them up for this forum, but more often I don't have the opportunity to do a sufficiently comprehensive review.