Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Eric_O_
Beginner
204 Views

Cilkplus port to Raspberry Pi 2B

I compiled the new release of gcc-5.1 with the Cilkplus parallel processing extensions and runtime library for ARMv7 architecture on the Raspberry Pi 2B single board computer.  Two changes were needed.

The first change corrects a typo in generic/cilk-abi-vla.c by changing the second to the last line of the file from

vla_internal_heap_free(t, full_size);

to

vla_internal_heap_free(p, full_size);

the second change was to generic/os-fence.c and ARM specific. Comment out the line

COMMON_SYSDEP void __cilkrts_fence(void); ///< MFENCE instruction

as

// COMMON_SYSDEP void __cilkrts_fence(void); ///< MFENCE instruction

and then add the define

#define __cilkrts_fence() __asm__ volatile ("DSB")

right above it.  I've been testing the results and getting reasonable parallel speedup using 4-cores on a number of algorithms.  My results are posted on the Raspberry Pi forum under the topic "Programming C/C++" in the thread "Cilkplus on RPi2B."

It appears that cilk_spawn, cilk_sync and cilk_for are running without errors; however, I've not optimized the stack swapping code in generic as has been done for Intel architecture CPUs.

Is anyone working on this?

 

0 Kudos
9 Replies
Jim_S_Intel
Employee
204 Views

Just fyi, the "cilk-abi-vla.c" file has to do with supporting variable-length arrays (VLAs) within a Cilk Plus (spawning) function, and not the generic stack switching done by the runtime.   If your code does not use VLAs in a spawning function (which most don't), then those functions should not be called.   Moreover, I believe VLAs require compiler support, so I think you'd have to double-check whether the compiler is generating code for VLAs or not.

Last I recall, there were no architecture-specific optimizations in the runtime for switching stacks.   But Cilk Plus runtime development is no longer my primary role, so others may or may not have more up-to-date information.   The original theory and design behind Cilk was to make the spawns cheap, at the cost of more expensive steals, since steals are supposed to be rare.   Thus, I don't think optimizing the stack switching for a particular architecture is necessarily going to provide a large payoff.

Cheers

Jim

 

Hansang_B_Intel
Employee
204 Views

Hi Eric,

Thank you for sharing this information!

Could you submit your contribution to https://www.cilkplus.org/submit-cilk-contribution, so that it is included in the cilkplus source package (then, in GCC)?

Eric_O_
Beginner
204 Views

Jim Sukha (Intel) wrote:
Just fyi, the "cilk-abi-vla.c" file has to do with supporting variable-length arrays (VLAs) within a Cilk Plus (spawning) function, and not the generic stack switching done by the runtime.

Thanks for clarifying.  If I understand correctly, variable length arrays defined like

int n=atoi(argv[1]);
double x;

are always allocated on the stack in regular C, while in Cilkplus it appears generic/cilk-abi-vla.c always allocates such arrays on the heap and the extra code in x86/cilk-abi-vla.c is designed to allocate variable length arrays on the stack if possible and only uses the heap if necessary.  Is this correct?

The first version of the parallel recursive FFT that I wrote actually did allocate variable-length temporary arrays in the recursively cilk_spawn'ed subroutine and appeared to work as expected on the Raspberry Pi 2B.  Note, however, for performance and memory use reasons the parallel recursive FFT test that I posted on the Raspberry Pi forum does not allocate variable length arrays.  This gives me ideas what more should be tested in the current ARMv7 build.

Eric_O_
Beginner
204 Views

HANSANG B. (Intel) wrote:
Thank you for sharing this information!  Could you submit your contribution to https://www.cilkplus.org/submit-cilk-contribution, so that it is included in the cilkplus source package (then, in GCC)?

Currently my patch just changes two lines of source in libcilkrts/config/generic rather than properly creating a new architecture subdirectory such as libcilkrts/config/arm for the arm specific changes.  After this is polished into a proper patch, I'll send it in.

Jim_S_Intel
Employee
204 Views

Eric O. wrote:

Thanks for clarifying.  If I understand correctly, variable length arrays defined like

int n=atoi(argv[1]);
double x;

are always allocated on the stack in regular C, while in Cilkplus it appears generic/cilk-abi-vla.c always allocates such arrays on the heap and the extra code in x86/cilk-abi-vla.c is designed to allocate variable length arrays on the stack if possible and only uses the heap if necessary.  Is this correct?

The first version of the parallel recursive FFT that I wrote actually did allocate variable-length temporary arrays in the recursively cilk_spawn'ed subroutine and appeared to work as expected on the Raspberry Pi 2B.  Note, however, for performance and memory use reasons the parallel recursive FFT test that I posted on the Raspberry Pi forum does not allocate variable length arrays.  This gives me ideas what more should be tested in the current ARMv7 build.

I did not work on that section of code, but I believe you are correct.  Allocating the array on the stack is likely faster than putting in on the heap, but it requires additional work to figure out what to do.   Note also, that the tricky case occurs when the variable-length array is declared in a continuation (i.e., after a cilk_spawn, but before the cilk_sync), because the continuation may execute on a different stack.   A variable-length array that is declared at the beginning of the function might "just work" if it is pushed onto the stack before the first cilk_spawn.   But you shouldn't quote me on that, since it probably depends on implementation details that I am not familiar with.

As far as testing on non-x86 architectures goes, the places that I might expect the most potential issues would be in the work-stealing / synchronization sections of the code (e.g., the THE protocol, __cilkrts_leave_frame, etc.), since those are places where a difference in memory model might introduce bugs.   Some stress-tests on steals might be more likely to reveal some of those kinds of issues if they exist.

Cheers,
 
Jim
Barry_T_Intel
Employee
204 Views

We worked with the Intel compiler developers to add support for Variable Length Arrays in spawning functions, so the Intel compiler knows to call __cilkrts_stack_alloc() and __cilkrts_stack_free() to allocate and delete a VLA. This allows the Cilk runtime to expand the stack for the VLA if possible, or allocate it on the heap if necessary.

I don't believe that the GCC implementation of VLAs knows anything about spawning functions, so use of VLAs in a spawning function is currently not supported in GCC. Which is why I just threw together the generic implementations of the functions - they never get called. I figured we'd flesh them out when we added VLA support in spawning functions to GCC. I guess that time is now. :o)

    - Barry

Eric_O_
Beginner
204 Views

This message is to indicate that I've just created a new patch for gcc-5.2 to support Cilk on Raspberry Pi.  The patch is now cleaner in the sense that it creates a new directory config/arm which contains the architecture specific files in a way similar to config/x86. 

http://fractal.math.unr.edu/~ejolson/patches/gcc-5.2.0-ejo.patch

Unfortunately the above patch also contains a one-line change to the cpp preprocessor to enable UTF-8 in C identifiers.  Fortunately this change for UTF-8 support appears at the end of the patch and is easy to remove.  Note also that the directory config/arm needs to be created and a few files copied before applying the patch.  Exact details how to apply the patch and build a working compiler are provided at

https://www.raspberrypi.org/forums/viewtopic.php?p=802657

The patch has been tested and works with ARMv6 of the original Raspberry Pi and with ARMv7 of the new Raspberry Pi 2B.  Hopefully this is enough to get the ARM Cilkplus patch into mainline gcc for the next release.  Please let me know if anything else is required.

Hansang_B_Intel
Employee
204 Views

Eric O. wrote:

This message is to indicate that I've just created a new patch for gcc-5.2 to support Cilk on Raspberry Pi.  The patch is now cleaner in the sense that it creates a new directory config/arm which contains the architecture specific files in a way similar to config/x86. 

http://fractal.math.unr.edu/~ejolson/patches/gcc-5.2.0-ejo.patch

Unfortunately the above patch also contains a one-line change to the cpp preprocessor to enable UTF-8 in C identifiers.  Fortunately this change for UTF-8 support appears at the end of the patch and is easy to remove.  Note also that the directory config/arm needs to be created and a few files copied before applying the patch.  Exact details how to apply the patch and build a working compiler are provided at

https://www.raspberrypi.org/forums/viewtopic.php?p=802657

The patch has been tested and works with ARMv6 of the original Raspberry Pi and with ARMv7 of the new Raspberry Pi 2B.  Hopefully this is enough to get the ARM Cilkplus patch into mainline gcc for the next release.  Please let me know if anything else is required.

Thank you for sharing your contribution!

It might take some time for your contribution to be part of GCC mainline, but it will happen eventually.

ejol
Beginner
204 Views

This is Eric who started this thread.  It's been some time since I tried to post and somehow I could not login or reactivate my old account.  I recently compiled gcc-7.1 on ARM and am testing on an 8-core SBC based on the Samsung/Nexell S5P6818.  First off, I'm happy to see that ARM architecture is now recognized and no patching or fiddling with configuration files is necessary.  At the same time, there appears to be a cilkplus performance regression of about 40 percent slower since gcc-5.2.  Note that this regression doesn't affect the non-cilkplus version of the code, which still runs the same speed, nor does it affect gcc-7.1 cilkplus running on 64-bit Intel.  I'm posting this quick message to check whether 40 percent poorer performance of cilkplus on ARM with gcc-7.1 versus gcc-5.2 is well known and what the cause might be.  Thanks!

Reply