Solved: Behaviour of caching with openMP

Amorin · ‎06-08-2023

Hi,

As far as I understand, when a thread has cached a variable, and modifies it, only its own cached copy is updated. Then all other cached copies in other threads are invalidated, but not synchronised. Only if another thread tries to access the same variable (or another on the same cache line), will it be synchronised. First question: is this right?

So if parallel sections are well designed, and threads are working on variables far enough from each other that they don't end up on the same cache line, at the end of the parallel section, there could be a lot of change in cached copies, which are not yet copied back to main memory.

So to the main question: when a parallel section ends (Omp end parallel), are all caches synchronised back to the main memory? Does this imply a barrier?

One more question: who handles cache/cache coherence? Hardware, OS, Fortran runtime library, other?

Alexandre.

jimdempseyatthecove · ‎06-08-2023

Consider reading this: FLUSH Directive (intel.com)

An issue to consider that isn't apparent is that within a code section, a variable may be brought into a register and then the register can be reused in that code section without re-reading the variable. A similar behavior can occur with multiple writes to the same variable. IOW the register is written after the last write/update as the program exits the code section. The FLUSH directive is to assure that the register copies are written via appropriate memory write, which inserts it into the appropriate cache line, and which may or may not be directly written to RAM. It is undisclosed as to if the FLUSH also issues a CPU instruction to asset the cache line is flushed to RAM.

Note, unnecessary cache line flushing can be detrimental to performance.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎06-08-2023

Consider reading this: FLUSH Directive (intel.com)

An issue to consider that isn't apparent is that within a code section, a variable may be brought into a register and then the register can be reused in that code section without re-reading the variable. A similar behavior can occur with multiple writes to the same variable. IOW the register is written after the last write/update as the program exits the code section. The FLUSH directive is to assure that the register copies are written via appropriate memory write, which inserts it into the appropriate cache line, and which may or may not be directly written to RAM. It is undisclosed as to if the FLUSH also issues a CPU instruction to asset the cache line is flushed to RAM.

Note, unnecessary cache line flushing can be detrimental to performance.

Jim Dempsey

Amorin · ‎06-08-2023

Thanks, so FLUSH is implied at the end of a parallel section, which may explain some overhead at the end of it.

Maybe we should consider NOWAIT everywhere and flush explicitly variables one at a time on a need-basis, just before using them next time.

jimdempseyatthecove · ‎06-08-2023

NOWAIT requires that the code following the parallel region is NOT going to reference data (~still being) generated in the parallel region. So be careful in its use. i.e. be correct.

Jim Dempsey

Amorin · ‎06-09-2023

Is there something else than flushing (and a barrier) that is skipped by NOWAIT?

jimdempseyatthecove · ‎06-09-2023

The design purpose of NOWAIT is to permit threads of a parallel region to pass through an implicit barrier within the region. See second example in: OpenMP* Examples (intel.com)

subroutine do_2(a,b,c,d,m,n)
  real a(n,n), b(n,n), c(m,m), d(m,m)
  !$OMP PARALLEL SHARED(A,B,C,D,M,N) PRIVATE(I,J)
    !$OMP DO SCHEDULE(DYNAMIC,1)
       do i = 2, n
        do j = 1, i
          b(j,i) = ( a(j,i) + a(j,i-1) ) / 2.0
        end do
      end do
    !$OMP END DO NOWAIT
    !$OMP DO SCHEDULE(DYNAMIC,1)
      do i = 2, m
        do j = 1, i
          d(j,i) = ( c(j,i) + c(j,i-1) ) / 2.0
        end do
      end do
    !$OMP END DO NOWAIT
  !$OMP END PARALLEL 
end

Unfortunately the description isn't descriptive enough.

Note, the !$OMP DOs are contained within a parallel region. NOWAIT is not applicable to !$OMP PARALLEL DO.

As structured above, all threads begin the first loop picking one index of the i loop on a first-come first-served basis. When the pick sequence expires, the NOWAIT permits that/those thread/s to progress to the second loop and to start picking it's iteration of the second loop.

Jim Dempsey

Ron_Green · ‎06-09-2023

@Amorin you really are asking several questions here.

One topic is "false sharing" where 2 processors, each with it's own L1 cache, are modifying the same cache line. Yes, that is a huge performance hit. This can occur if 2 or more threads are modifying an array simultaneously in unit stride (stride 1). Don't do this.

Next you have to consider how you wish to divide up the array elements and let each thread take a division, or a BLOCK. Fortunately, all OpenMP runtimes do this for you by default so you don't have to do it manually. Consider this

!$OMP PARALLEL DO
        DO I=1,N
            B(I) = (A(I) + A(I-1)) / 2.0
        END DO
 !$OMP END PARALLEL DO

If you have 4 threads working on this loop, each is modifying B(I), you don't want the 4 working on the interations like this:

thread 1, works I=1, 5, 9 ...
thread 2, works I=2, 6, 10 ...
thread 3, works I=3, 7, 11 ...
thread 4, works I=4, 8, 12 ...

False sharing, right? B in each L1 cache gets into a cache line with neighboring elements, which the other 3 threads want to use. So you do nothing but set dirty bits, sych the caches, let the next thread dirty the cache line, sych, repeat.

SO any reasonable OMP runtime by default breaks the N iterations into blocks or "chunks". So each thread takes, say, 32 iterations. that way they are not touching the same cache lines. Using SCHEDULE, look at the chunk_size modifier. If N changes based on problem size, you can use the API call to omp_set_schedule (and omp_get_schedule) to make decisions on runtime chunk sizes.

Also you need to think of load imbalanced problems where iterations may have dramatically different amount of work to do. Look up dynamic scheduling for those cases, or tasking.

Now all that said, you also asked about when caches flush to memory. That is under hardware control while the thread is running. As Jim said, at the end of a thread's life the OMP RT will flush the thread's cache. During the thread life you can't control what is cached, what is in registers, etc without a lot of work and frankly you'd be a fool to try. The HW does a far far better job that you. Don't waste time on this thought. BUT there is a programmer's trick worth mentioning.

Assume that N is really really big. And

N/#threads

is also really big. Then B(I) when I >> cache size for large iterations will not fit in cache. So as the thread is writing to cache, cache fills up, the HW has to flush the cache to memory. If you know N is really big, and N/#thread >> cache size then you know that cache is really not buying you anything since it's going to be flushing to memory during this modification of a very large vector. In this case, you'd want to write direct to memory, right? Well

-qopt-streaming-stores

is there for you to use. Note that it's a compiler option and applies to a source file. It tells the backend code generator to use special processor streaming store instructions that move data from registers to memory directly. IF the processor supports this instruction. If the source has these large vector modifications then YES use it. But if it's mix of large and small vectors - cache will probably help. So try it. you won't find it helps much, single digit at most, but some people like to fight for 1-5% speedup in these rare cases. You may find for most general codes that this actually hurts performance. Modern caches are complex and usually make the right decisions for you on when to cache and when to flush. Personally I don't mess with this unless I am working with an obsessive customer.

Amorin · ‎06-14-2023

Thank you for your detailed answer. I don't think we have false sharing issues, but load imbalance is likely.

But your last comment about big arrays filling the cache is interesting. Because we have those huge arrays being passed as dummy arguments, and local working arrays which are smaller. Typically, there will be arrays with different properties, for example P(1:N) where N is in the order 100 to 100000 (The most interesting in this discussion is when N is in the upper range, we know we won't get good scalability for small N). Then the typical loop is

DO i = 1, N
   CALL do_a_lot_of_work(P,Q,R,i)
END DO

where do_a_lot_of_work() will read for example P(i), Q(i) and write into R(i). The big arrays are passed without slicing as

DOUBLE, INTENT(inout) :: P(:), Q(:), R(:)

Is the compiler going to try and cache the whole array, or will it cache only a few elements around P(i), ... ?

PS: I am probably an obsessive customer. It is mostly parallel scalability that is important. 5% speed-up is not big on 1 core, but 5% increased parallel efficiency becomes a lot on many cores.

TobiasK · ‎06-19-2023

@Amorin

The compiler is not responsible for caching something. The compiler can generate prefetch instructions but should not do so in that case.

The CPU reads data from memory as soon as it accesses the data locations. That data will reside in the caches. The CPU is able to write directly to the memory without putting data into caches, so called streaming stores but that's it.

If your function do_a_lot_of_work really looks like what you described, e.g. accessing a single element of arrays P, Q, R I would not pass the whole arrays but just the elements itself. If you declare an array as assumed shape, additional data has to be passed to the function. In your case I would then also declare the subroutine as elemental.

As Jim mentioned, if you are looking into parallel scalability this is probably the wrong approach.
If you are using only OpenMP to parallelize your program you may run into NUMA effects due to 'wrong' data placement among others.

Best
Tobias

jimdempseyatthecove · ‎06-14-2023

We are at a point where insufficient information is available to us to provide specific advice. You are at the point where you need to to use VTune in Administrative mode to locate the bottlenecks of the program. Your issue may or may not be cache related, but something entirely unexpected.

For example, depending on expressions used and/or arguments passed and/or lack of interfaces and/or inefficient interfaces... an unexpectedly large number of temporary arrays may be created without your knowledge. This will cause scalability issues more so than observed in single threaded use.

VTune (hardware sampling) can also locate cache line conflicts.

Jim Dempsey

Behaviour of caching with openMP

Fortran Language

OpenMP

Performance