It is usually stated that the L1D latency is around 4 cycles. How are those 4 cycles utilized?
1c: Calculate effective address
1c: Send request from core to cache
1c: Do cache access
1c: Send data back to core pipeline
Is there any available information on that?
The 4 cycle count is based on standard latency measurements. By 'standard' I mean load-to-use, dependent-chain and the array fits in L1d. The prefetchers need to be disabled in the BIOS (if you have use a stride like 64 bytes which the prefetchers can latch on to). Turbo mode needs to be disabled (again, you can do this from the bios) to get the 4 cycle count.
So the 4 cycle count is 'load to use' so... from the time the load is issued until the data is received. The 'dependent load' means that the address of the next line to be read is contained in the current data being fetched.
In my tools, I use latency kernels based on Calibrator from Stefan Manegold http://www.cwi.nl/~manegold/
Your 4 steps correspond to the load-to-use scenario.