You're a little off with how MPI_Send and MPI_Recv work. By default, MPI_Send only needs to copy the data (and a few other things) somewhere safe before returning. Different implementations can handle this in different manners. One variant of MPI_Send is MPI_Ssend. This is a synchronous send, and will not return until the matching MPI_Recv occurs. That is not forced in the Intel® MPI Library. Intel MPI uses a buffered send, which copies the data to a buffer for communication later, and will return once the data is ready for a later MPI_Recv call. If you need to force the synchronous behavior, use MPI_Ssend instead of MPI_Send.
However, for performance reasons, I recommend sticking with MPI_Send and using tags to differentiate each message. If you assign a unique tag for each send/receive pair, this will force the pairs to match.