- Parallel Computing
Intel does not document the details of its coherence protocols, but the ordering model is described in some detail in Section 8.2 of Volume 3 of the Intel Architectures Software Developer's Manual (Intel document 325384).
The details of the implementation of the coherence protocol (and its associated ordering model) differ across processor models, and can differ based on configuration settings (e.g., Home Snoop vs Early Snoop on Haswell EP, or 1-socket vs 2-socket vs 4-socket configurations). In addition, some coherence transactions can be handled in different ways, with the choice made dynamically at run time.
The most interesting details usually pop up in the Uncore Performance Monitoring Guides for the various processor models. Understanding the implications of these details requires significant experience in microarchitecture and microbenchmarking, though there are many aspects of Intel's implementations that are probably too hidden to be fully understood by anyone who does not have access to the actual processor RTL (or whatever high-level-design language Intel uses these days).
The short answer to #1 is that coherence processing is serialized at the various coherence agents. For example, if two cores execute a store instruction in the same cycle and both miss in their L1 and L2 caches, then the transaction will go to the L3 slice (or CHA in some processors) responsible for that address, which will process the incoming requests sequentially. One or the other will "win" and will be granted exclusive access to the cache line to perform the store. During this period, the request from the "losing" core will be stalled or rejected, until eventually the first core completes its coherence transaction and the second core's transaction is allowed to proceed. There are many ways to implement the details, and Intel almost certainly uses different detailed approaches in different processor models.
The short answer to #2 is that also that the transactions will be serialized. Either the owning processor will complete the store, then the line will be transferred to the reading processor, or the cache line will be transferred away from the owning processor to the reading processor first, then returned to the (original) owning processor to complete the store. Again, there are many ways to implement the low-level details, and Intel processors implement several different approaches. If I am reading between the lines correctly, recent Intel chips will use different transactions for this scenario depending on how frequently it happens to a particular address. (E.g., See the description of the HitMe cache in the Xeon E5/E7 v3 processor family uncore performance monitoring guide, document 331051.)