I was wondering if the L1, L2 and L3 on Haswell/Broadwell and SkyLake are physically or virtually indexed caches? Is this information published somewhere?
I believe L1 and L2 are virtually indexed, whereas L3 is physically indexed.
This allows the cpu to fetch L1 cache lines without having todo TLB lookups 1st.
I don't know where this is officially mentioned, but some details here regarding SB. I believe it's the same still in Haswell/Skylake.
There is no difference between virtual and physical indexing for the L1 Data Cache on most Intel processors, since the 32 KiB L1 Data Cache is exactly one 4KiB page "tall" by 8 wide (via associativity), so none of the bits used to index into the cache structure are translated. This means the cache access can get started before the address translation is complete, and the 1:1 mapping means that no additional complications have been added. Physical tagging is required at all levels to avoid horrible difficulties in dealing with aliasing.
I would be surprised if the L2 unified cache were virtually indexed, but I suppose it it possible. In Nehalem/Westmere, Sandy Bridge/Ivy Bridge, and Haswell/Broadwell, the private, unified L2 cache is 256 KiB and 8-way associative. This means that it is 8 4KiB pages "tall" by 8 "ways" wide. If the cache is virtually indexed, then there is a possibility of aliasing -- virtual addresses in any (or all) of the 8 "page slots" might map to the same physical address, and the hardware would need to ensure that these cases are handled correctly. It is much easier to implement a physically indexed cache at this level -- there is plenty of time to perform the address translation before the L2 is accessed, and no matter what the virtual address, all mappings to a particular physical address will always access the same congruence class in the L2 cache.
It is easy enough to test whether the L2 unified cache is virtually or physically indexed if your OS has a function to map virtual address to physical addresses. (In Linux, this is provided through the /proc/<pid>/pagemap interface.) Allocate a large array on 4KiB pages, then find 9 pages whose virtual addresses map to the same location in the cache (virtual address modulo 32KiB), but whose physical addresses map to different locations (physical address modulo 32KiB). If the cache is virtually indexed, you won't be able to hold all 9 of those pages in the L2 at the same time, while if the cache is physically indexed the pages will go into different locations in the L2 cache and will all fit.
AMD's Opteron (K8) processor had a 64KiB, two-way associative L1 Data Cache that was virtually indexed. The cache was 8 pages "tall" by 2-way associative, so aliasing was a problem that needed to be dealt with. A couple of sneaky tricks were used to allow the cache controller to determine is aliasing was present without requiring excessive cache tag reads. I don't remember the details, but it was definitely a two-step procedure, with the virtually indexed line being processed as fast as possible and the (almost always null) indication of aliasing showing up a cycle or two (??) later.