Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Locking CPU cache lines for a thread ( L1)

Younis_A_
Beginner
3,419 Views

Hi
I'm working on securing access to L1 cache by locking it line by line. Is there any way to do it? For example, two threads accessing the L1 and L1 lines are locked for a certain time to each thread accessed them.
Regards,

Younis

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
3,387 Views

If the threads are in a different process, then the virtual address spaces and physical address spaces (at any one time) preclude sharing of L1 cache. *** subject to your process not setting up shared memory between processes ***

You may have multiple threads within the same process (sharing the same virtual memory) whereby each thread can access all of the process's virtual memory. In this case, the multiple threads from the same process can share the same cache line.

If you want to exclude this from happening, then split your program into multiple processes.

You can use various inter-process messaging techniques and/or have one or more blocks of shared memory between processes. The information you want to hide from the other process is not to be placed into the shared memory block(s).

Jim Dempsey

View solution in original post

0 Kudos
21 Replies
jimdempseyatthecove
Honored Contributor III
3,155 Views

What are the operations you wish to perform during the lock?

You are aware that multiple addresses map to the same cache line. View this like a shared parking slot. Only one car at a time can park in slot N, though their license plate number (memory address) will identify who's in the slot. The rules for the parking slot are "If other car in slot, push it out of the slot".

Answering the first question may yield a solution that you haven't thought of.

Jim Dempsey

0 Kudos
Younis_A_
Beginner
3,155 Views

Thank you for replying. I don't care about what kind of operations that will take place. I'm trying to prevent threads from gaining any information about what addresses the victim thread has accessed. In my point of view, if I can lock cache lines that a thread accesses for a certain time and then flush them after predefined time, other threads cann't gain any information about the accessed addresses and lunch the attack.

Younis

0 Kudos
Younis_A_
Beginner
3,155 Views

Thank you for replying. I don't care about what kind of operations that will take place. I'm trying to prevent threads from gaining any information about what addresses the victim thread has accessed. In my point of view, if I can lock cache lines that a thread accesses for a certain time and then flush them after predefined time, other threads cann't gain any information about the accessed addresses and lunch the attack.

Younis

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,388 Views

If the threads are in a different process, then the virtual address spaces and physical address spaces (at any one time) preclude sharing of L1 cache. *** subject to your process not setting up shared memory between processes ***

You may have multiple threads within the same process (sharing the same virtual memory) whereby each thread can access all of the process's virtual memory. In this case, the multiple threads from the same process can share the same cache line.

If you want to exclude this from happening, then split your program into multiple processes.

You can use various inter-process messaging techniques and/or have one or more blocks of shared memory between processes. The information you want to hide from the other process is not to be placed into the shared memory block(s).

Jim Dempsey

0 Kudos
le_g_1
New Contributor I
3,155 Views

Are you working on cache timing side channel attack on cryptographic keys? Well-known referred papers in this filed could give you enough hints.

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,155 Views

I should add that, for the sensitive data, that you allocate what is called (on Windows) non-pageable memory. And verify that the non-pagable memory does not reside in, or more precisely, is never written to the system page file. As the page file can potentially be read by other processes. If the memory management protection is weak, then a different process or service or driver or filter (virus) might snatch data you place into non-pageable memory. This leaves your "keep" (place where you store your valuables) to be limited to the register set. Yet this isn't entirely safe unless you can inhibit your process, or more precisely the hardware thread, from system interrupts, as at interrupt time the register set, or portions thereof, may get saved to RAM and then potentially seen.

Jim Dempsey

0 Kudos
Younis_A_
Beginner
3,155 Views

Thank you some much for all responses. I found ARM code that can be used to lock CPU L1 cache lines, but can't find the same on Intel. So, is there any direct or indirect code to lock CPU cache lines for a process, thread or even a VM on Intel processors?

Younis

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,155 Views

On the Intel IA-32 and Intel-64 processors the L1 and L2 caches are typically constricted within each core within the processor. Some of the processor core designs permit reading other core's L2 cache without going through RAM but I am not aware of being able to directly read other core's L1 (HT siblings can read the core's L1). Any core can potentially cause an eviction of the other core's L1, however this comes with a restriction that the other cores also have to map to the same physical address. This said, the Intel IA-32 and Intel-64 processors do not have a means that I am aware of to write to L1 cache while inhibiting the write data from being enqueued to RAM. ARM may treat L1 cache as an extended register set, Intel cache design as an remembrance of data written to RAM.

This leaves as your "only" recourse the register set. On Intel-64 you have a sizable number of registers per hardware thread ~13 x 64-bit, 16 x 128-bit or 256-bit. This is per hardware thread. You also have the x87 FPU stack to store stuff into.

The "only" above can be circumvented if you have available a set of physical addresses that you can map to that shares the characteristics that it is cacheable .AND. appears writeable, but in fact is non-writable. You would also like it not to be located on a bus that can be snooped.

The "trick" then is for you to interact with the O/S to constrict your (some of your) software threads to specific hardware threads (affinity binding) .AND. exclude the selected hardware threads (and core) from participating in interrupt handling .AND. exclude the O/S from scheduling other software threads to those hardware threads. Some O/S's may have Real-Time support API's that permit you to do this.

Jim Dempsey 

 

0 Kudos
Younis_A_
Beginner
3,155 Views

Thank you Jim for replying. It's really appreciated.

0 Kudos
SergeyKostrov
Valued Contributor II
3,155 Views
>>...I'm working on securing access to L1 cache by locking it line by line. Is there any way to do it? Try to boost a priority of a thread that will do main processing to Real-Time. ( Note: Used that technique in a financial system when some cryptography software subsystem had to do some task ).
0 Kudos
jimdempseyatthecove
Honored Contributor III
3,155 Views

Sergey,

Boosting the priority might not exclude it from being interrupted. You would need an O/S feature that would permit a (privileged) thread to request, and get acknowledgement of that request, to permit it to run continuously. While this would be similar to requesting real-time priority, real-time priority might not preclude preemption. Example: over subscribing the number of threads requesting real-time priority. If granted, the O/S would time-slice the threads. If not granted, then the threads granted rights might be given full runtime. Note, this thread, could contain call-back functions run by other threads (e.g. driver), but the thread must not be interrupted even at the request of system shutdown. Presumably a call back function, run by different thread, would write a flag in memory indicating the shutdown request. Periodically the secured thread would poll the shutdown request and destroy any sensitive information held in registers, then terminate itself.

In addition to this, the program would have to be written to not call any O/S function or local function that would save the registers that are required to be un-snoopable.

If the above is followed, there still may be a very small chance for reverse engineering the protected information if the code is not written to take this into consideration. This is, while the registers can be protected by the above (if provided for), what is not protected would be the memory fetches used by the code. Additionally, the performance counters of that thread might be readable. If the code space is somehow readable by a different thread (it will be to some threads), then the combination of the performance counter and memory fetches might yield some insight as to what were the initial inputs to the protected code, and which the spying program may have a copy of. The spying code could then re-run the code with the results now visible to itself. The protected code would have to be written to circumvent this type of attack.

Jim Dempsey

0 Kudos
McCalpinJohn
Honored Contributor III
3,155 Views

The kind of locking that ARM (optionally) supports is not supported by most general-purpose processors.  I have not run across such functionality while working in the design teams at SGI (MIPS, Itanium), IBM (POWER), or AMD, and I have not seen any indication that Intel supports such a feature either.

Cache locking or similar functions seem to be limited to processors targeting embedded markets.  ARM supports several types of cache locking, while TI processors (DSPs) support configuring the SRAM as partly cache and partly locally controlled memory.  For example, a chip with a 64 KiB "level-1 SRAM" could configure 0,16,32,48, or 64 KiB as cache, with the remainder as explicitly controlled local memory.

It is essentially impossible for unprivileged code to gain specific information about the memory locations accessed by another thread, and surprisingly difficult to get even general information.  If a system is configured for time-sharing and allows an "attacker" task to request services from a "target" task, and provides a high-resolution timer, then some information about how long it takes to complete the task(s) can be obtained.  This is typically used for timing attacks against compute-intensive services -- it is much more difficult to learn anything about memory accesses because there are so many different ways for the "target" process to get the same average memory latency.  Latency for a load can take almost any value between ~4 cycles and well over 500 cycles, making it effectively impossible to fit an unambiguous model for more than a handful of accesses.    (I know this because I have spent much of the last 15 years building and testing models for understanding memory accesses in cases where I control everything, and it is really hard work -- even with perfect control over the code being executed, the process placement, the page size(s), and with full access to hardware performance counters.)

0 Kudos
Bernard
Valued Contributor I
3,155 Views

By boosting a thread to Real-Time priority as Jim said will not keep thread from being interrupted either by ISR/DPC routine or by system thread running at the same priority. Moreover there are  "housekeeping" system threads which are running at lower priority and those threads could be preempted also which can cause system instabillity.

0 Kudos
Zakaria_I_
Beginner
3,155 Views

Hi M. Younis A,

Can i have your address mail? in order to have a discussion on this interesting subject. I'm working on it too.

0 Kudos
Zakira_I_
Beginner
3,155 Views

Hi all

I am looking for code of side channel attacks and covert  channel attacks in cloud. Could any one send me that code please.

kindly guide me I need your help. 

0 Kudos
Jason_V_
Beginner
3,155 Views

<p>
Surely there is some way for kernel code at highest privilege level (shared kernel address space) to lock a page<br>
in memory (given physical memory address) and ensure that at least part of it is ALWAYS in one or more L1 cache lines ? <br>
I am currently trying to figure out precisely how that might be done on my 4 (8) core Haswell i7-4910MQ in a Linux kernel module - my cache info is:<br>$ cat /sys/devices/system/cpu/cpu0/cache/index{1,2,3}/{coherency_line_size,level,size,shared_cpu_list,ways_of_associativity}<br>
64&nbsp; 1 32K&nbsp; &nbsp;&nbsp; 0,4 8<br>64&nbsp; 2 256K&nbsp;&nbsp; 0,4 8&nbsp;&nbsp; <br>
64&nbsp; 3 8192K 1-7 16
<br></p><p>

So I can access L1 cache with line size 64 on all 4 physical cores. &nbsp;<br>

Is there any way of pinning / reserving a set of contiguous lines (say, 0-4) to a non-memory address ( not mapped to memory ) -<br>
just use cache as "inter-core communication register memory",<br>
OR to a contiguous area of one physical page in memory that is never un-cached ? <br></p><p>Just wondering where to find more documentation on above, and would appreciate any advice - possible to do this in assembler&nbsp; ?&nbsp; Which instructions control cache line loading and locking in kernel mode ? investigating ...<br> thanks in advance for any replies.<br><br><br><br><br><br><br></p>

 

 

0 Kudos
Jason_V_
Beginner
3,155 Views

RE:

<p>Surely there is some way for kernel code at highest privilege level (shared kernel address space) to lock a page<br>in memory (given physical memory address) and ensure that at least part of it is ALWAYS in one or more L1 cache lines ? <br>I am currently trying to figure out precisely how that might be done on my 4 (8) core Haswell i7-4910MQ in a Linux kernel module - my cache info is:<br>$ cat /sys/devices/system/cpu/cpu0/cache/index{1,2,3}/{coherency_line_size,level,size,shared_cpu_list,ways_of_associativity}<br>64&nbsp; 1 32K&nbsp; &nbsp;&nbsp; 0,4 8<br>64&nbsp; 2 256K&nbsp;&nbsp; 0,4 8&nbsp;&nbsp; <br>64&nbsp; 3 8192K 1-7 16<br></p><p>So I can access L1 cache with line size 64 on all 4 physical cores. &nbsp;<br>Is there any way of pinning / reserving a set of contiguous lines (say, 0-4) to a non-memory address ( not mapped to memory ) -<br>just use cache as "inter-core communication register memory",<br>OR to a contiguous area of one physical page in memory that is never un-cached ? <br></p><p>Just wondering where to find more documentation on above, and would appreciate any advice - possible to do this in assembler&nbsp; ?&nbsp; Which instructions control cache line loading and locking in kernel mode ? investigating ...<br> thanks in advance for any replies.<br><br><br><br><br><br><br></p>

 

I am reading "Intel 64 and IA32 Architectures Manual" Combined version, Vol. 3, "System Programming", section 11.5, 
about the cache loading instructions : ( PREFETCHh, CLFLUSH, CLFLUSHOPT , MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD).   So ONLY if every task has exactly the same page mapped at 0x7fff ffff ffff 000 , say, 
and the kernel maps that virtual address to the same kernel page in every process, like the VDSO, there is still no
way to ensure it is in a known set of cache lines in EVERY processor's cache ?
Because the cache states are only :
  ( Strong Uncacheable, Uncacheable, Write Combining, Write Through, Write Back, Write Protected )
every task's code, and the kernel, would still have to read that page every N cycles on all processors, 
or issue cache load instructions, - (what is N) - to guarantee that page is ALWAYS cached in every processor ?
Since there is neither an "UNMAPPED" (not RAM-backed) or "Always Cached" state ?

I wish there was either an "reserved but unmapped" state or an "Always Cached" state for intel cache lines - it appears not.

 

0 Kudos
Jason_V_
Beginner
3,155 Views

This was my original comment which somehow escaped formatting :

Surely there is some way for kernel code at highest privilege level (shared kernel address space) to lock a page
in memory (given physical memory address) and ensure that at least part of it is ALWAYS in one or more L1 cache lines ?
I am currently trying to figure out precisely how that might be done on my 4 (8) core Haswell i7-4910MQ in a Linux kernel module - my cache info is:
$ cat /sys/devices/system/cpu/cpu0/cache/index{1,2,3}/{coherency_line_size,level,size,shared_cpu_list,ways_of_associativity}
64  1 32K     0,4 8
64  2 256K   0,4 8  
64  3 8192K 1-7 16

So I can access L1 cache with line size 64 on all 4 physical cores.  
Is there any way of pinning / reserving a set of contiguous lines (say, 0-4) to a non-memory address ( not mapped to memory ) -
just use cache as "inter-core communication register memory",
OR to a contiguous area of one physical page in memory that is never un-cached ?

RE:
there is still no way to ensure it is in a known set of cache lines in EVERY processor's cache ?
Because the cache states are only :
  ( Strong Uncacheable, Uncacheable, Write Combining, Write Through, Write Back, Write Protected )
every task's code, and the kernel, would still have to read that page every N cycles on all processors, 
or issue cache load instructions, - (what is N) - to guarantee that page is ALWAYS cached in every processor ?
Since there is neither an "UNMAPPED" (not RAM-backed) or "Always Cached" state ?

 

I guess the answer to my question is to use QPI / the dpdk : http://dpdk.org ; for inter-core on-chip communication ,
rather than try to fix memory in certain cache lines or use cache lines not backed by RAM - not possible.

But how to find out if my Haswell-MB (4910) chip has QPI or not ? I don't think I can explicitly use QPI on this chip ?
Or does it have a northbridge  ? Or should I use PCH /  I2C / SPI / GPIO ? Appreciate advice on how best to achieve
atomic on-chip communication of 64-bit values between a task on one core and a task on another core, where core is
any logical or physical core on the same chip , without going through external RAM (NUMA mode not enabled) .
 

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,155 Views

The cache levels "sit" between the Virtual to Physical memory translation system (TLB) and the physical RAM. This is duplicated, per core for L1, and per core or per tied-pair of cores for L2 depending on Intel architecture. The L3 (if one is present) is usually tied to within a single die, some CPUs have multiple dies within a package, and SMP system can have multiple packages.

The O/S (Windows and Linux versions) from the application's issue of a system call, typically does not have a dedicated thread/threads to perform the system function, rather the application's own thread transitions in privilege level to perform the call. Some system calls, are initiated by hardware interrupts. Depending on the O/S design, the O/S may have one or more hardware threads dedicated for this purpose, or more frequently pre-empt an unlucky user process thread. Regardless of the thread used/taken, at some point in the interrupt service routine, the task scheduler will get called to resume the software thread should one be waiting.

The cache design, of current IA32 and Intel64 does not have the features you seek.

a) It cannot lock specific cache lines in L1 , L2 or L3 if present. An application thread, running on a pinned hardware thread, can frequently touch the locations to, in essence, make them sticky, but all bets are off on preservation should the O/S choose this hardware thread to preempt for interrupt service.

b) Locking an L1 cache line (or making them sticky) is beneficial only for intra-core communication and not inter-core communication.

c) The cache coherency system permits deferred writes to physical RAM (movement from cache to RAM) but does not perform a "don't write to RAM" for L1 cache lines updated.

>>So I can access L1 cache with line size 64 on all 4 physical cores

No. (someone can correct me if I am wrong on this) the L1 cache of core x is not accessible by core y (it is accessible by HT sibling of same core). What is accessible is the core x L2 cache line paired to the core x L1 cache line by a paired core y (should there be one) that shares the L2 (some processors have two cores per L2, many only have one core per L2). When that is not available, what is accessible is the core x L3 cache line paired to any of the cores sharing the same L3 (this may be all or some of the cores on the same die). When that is not available, on multi-socket systems, some of them permit L3 to L3 transfers without going through RAM, others do not.

Your best option on Linux is to make a memory allocation (or post allocation re-attribute) from what is called a non-paged pool. This will prevent the page from being paged out to the page file. Then in your code, frequently _mm_prefetch(the memory location) by all/each core participating in the shared locations. Note, this may cause unnecessary updates (L3 -> L2 -> L1) and thus extend latencies unnecessarily.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,629 Views
0 Kudos
Reply