Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*

Help me understand dimensions, eu, threads.

mhogstrom
Beginner
276 Views

I work as a in the field of cybersecurity. I have have a rig at home with Arc 750 cards.
I have a background as a software developer and I am currently building my own hash cracker.
At the moment I am working on a brute forcer.

Alphabet, A-Z, a-z + symbols, and digits.

AAAAA
AAAAB
AAAAC
...

In my first version the host(cpu) created a buffer with a couple of million passwords.
I sent them over to gpu memory
And let the GPU work on them.

// Pseudo code
q.memcpy(gpu_buffer,host_buffer)
q.memcpy(gpu_hash_to_compare, host_hash_tocompare)
q.Parallel_For (range (0..count),  index) {
     char* key = gpu_buffer[index*16]   // fixed size
     if (gpu_hash_to_compare == calcHash(key) {
      // Found
      // STOP processing
     } 
}


All calculations are independent of each other.
The GPU is free to distribute work as it pleases.

I have read about dimensions. And that you can have blocks, work-groups.
That seems useful when you have work that is dependent on each other,
but when there are blocks that can execute in parallel.

In my case all work-items are independent, so I don't need work-groups?

Within hash-cracking there are mask-attacks. ?u?d?d?u  (uppercase, digit, digit, uppercase)
A11A 
A11B
A11C
..

To parallelize it I was going to feed it partial strings.
A11?
A12?
A13?
A14?


And parallelize it on the gpu side with a for-loop.
Horrible performance!!
Now I think the performance is related to something about SIMD I was missing out.

After that I read more about dimensions.
I rewrote the parallel_for loop to use 2 dimensions.

The first dimension is an index to "A11", "A12"
The second dimension represented the last digit or digits.
A11A A11B A11C A11D ..
A12A A12B A12C A12D ..
A13A A13B A13C A13D ..
.. 
This approach was an order of magnitude faster. 
3 Dimensions didn't make any difference.

I did some more reading on optimization.
I think I have just scratched the surface.
If I distribute work properly I will be able to speed up the program.
But that requires a bit more knowledge about the hardware.

On my Intel Arc a750:
eu, execution cores 448
hw_threads_per_eu 8

One way to send a batch job is
Keys/work_items = 448 * 8 * (some multiple)
 
But how do I size the dimensions in an optimal way?
I have looked at some tables for vtune, but there is a missing step for me.


At the moment, the fist dimension is the number of keys or partial keys available to process
To simplify my own algorithm, the second dimension equals the amount of different letters/digits on the last position or the two last positions.

It is a multiple of  10, 26, X, 94
Only_digits = 10
Uppercase= 26
All_letters_and_symbols = 94



How does threads works?
In a normal CPU, the threads are scheduled in on 1 core.
Do I need to worry about threads at all, or should I just send over a key batch that is a multiple of the number of EU/Execution Units?  

0 Kudos
0 Replies
Reply