Showing results for

- Intel Community
- Software Development Topics
- Intel® Moderncode for Parallel Architectures
- Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##
Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware

aminer10

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-19-2014
11:49 AM

10 Views

Hello,

My Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware is here, now you

don't need to allocate your arrays in different NUMA-nodes, cause i have implemented all the NUMA functions for you, this new algorithm

is NUMA-aware and cache-aware and it's really scalable on NUMA-architecture and on multicores, so if you have a NUMA architecture just run the "test.pas" example that i have included on the zipfile and you will notice that my new algorithm is really scalable on NUMA architecture.

Frankly i think i have to write something like a PhD paper to explain more my new algorithm , but i will let it at the moment as it is... perhaps i will do it in the near future.

This scalable Parallel library is especially designed for large scale industrial engineering problems that you find on industrial Finite element problems and such, this scalable Parallel library was ported to both FreePascal and all the Delphi XE versions, hope you will find it really good.

My new algorithm contains two parts that are the most expensive, and those two parts are: a vector multiplication by a transpose of a matrix, and a vector multiplication by a matrix, but when i have parallelized my previous algorithm, i have parallelized just the memory data cache transfer from the L2 cache-line hit to the CPU that costs around 10 CPU cycles for every double type, and i have parallelized also the multiplication of two doubles and addition of two doubles, but this was not enough, cause what we have to do also is parallelize the memory data transfers from the memory to the L2 cache , and this is what we call a NUMA aware algorithm that really scale on NUMA architecture, and this is what i have done in my new algorithm, the memory data transfers from memory to the L2 cache was also parallelized and this have made my new algorithm NUMA aware and really scalable on NUMA architecture and my new algorithm is also cache-aware.

You can download my Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware from:

https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear...

Thank you,

Amine Moulay Ramdane.

0 Replies

For more complete information about compiler optimizations, see our Optimization Notice.