How to understand fetch data from memory to cpu?
If cpu register fetch data miss from L3 cache, does it fetch data from memory to cpu register, or from memory to L3 cache, and then to register?
The question contains assumptions about behavior that is not specified by the architecture, so any direct answer to the question could (and almost certainly would) be misleading.
The topic can be addressed, but the question must be posed a bit differently.
A load instruction with a register argument will certainly put the data in the register. If the address being loaded is of the "WriteBack" memory type, then a copy of the cache line containing the address will be placed in the L3 cache. In most cases, a copy of the cache line will also be placed in the L1 Data Cache and the L2 cache, but there are implementations that can skip putting the data in the L2 cache.
The data can be delivered to the core and the caches in any order that is convenient. An implementation could put the data in the L3 cache, then copy it to the L2 cache, then copy it to the L1 Data cache, then copy it to the register. Or an implementation could do all of those operations concurrently, with no certainty about the order in which they would actually be completed. Or an implementation could do something in between, or it could do entirely different things at different times.
What an architecture guarantees is that it obeys an "ordering model". The ordering model is written in terms of loads and stores executed by different processes on different processors. An introduction to ordering models with special attention to the ordering model used by Intel is at http://15418.courses.cs.cmu.edu/spring2015/lecture/consistency. ;
Note that the ordering model does not discuss caches in any specific terms, and it says (almost) nothing about how the ordering model is implemented. This is on purpose. The ordering specifies certain aspects of the behavior that can be observed by users. There are often several different ways to implement the desired ordering mode, and different processor models can have very different implementations. The different implementations have different trade-offs of complexity, power consumption, performance, etc. By specifying a memory ordering model, users can write code that is correct on more than one processor model, and vendors can decide how they want to implement that functionality.