Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
603 Discussions

Array of struct vs struct of arrays performance doesn't match up across similar devices

FantasticMrFox
Beginner
2,512 Views

Classic optimization problem. Array of structs is simple, struct of arrays benifits from cache locality. I wanted to see the difference so i created the attached program. The program does a simple counting of occupied cells program with the intention that you need to gain read access to every cell. It produces the following result:

 

```

Running array of struct vs struct of arrays on <Intel(R) UHD Graphics [0x9bc4]> ...


Num cells | structure type | runtime average (ms) | runtime worstcase (ms)
----------------------------------------------------------------------------
1000 | Structs | 0.00544672 | 0.008666
1000 | Arrays | 0.00407695 | 0.0045
1000 | S Array | 0.00479932 | 0.009416
10000 | Structs | 0.0293872 | 0.037583
10000 | Arrays | 0.0259756 | 0.027333
10000 | S Array | 0.0264908 | 0.02725
100000 | Structs | 0.209335 | 0.21725
100000 | Arrays | 0.202833 | 0.21025
100000 | S Array | 0.202991 | 0.248416
1000000 | Structs | 2.0715 | 5.17675
1000000 | Arrays | 1.99674 | 2.23233
1000000 | S Array | 1.99635 | 2.02758
16777216 | Structs | 34.467 | 63.9828
16777216 | Arrays | 33.4852 | 60.468
16777216 | S Array | 33.6402 | 34.3354

```

Where:

Structs - Array of Structs (Using USM)

Arrays - Struct of Arrays (Using USM)

S Array - A single sycl::buffer holding the relevant data for comparison. 

 

This is on my pc with the card (from lspci):

>  9BC4, Intel® UHD Graphics Gen9 Coffee Lake

 

If you look at the table, you can see that the access times are all basically the same. 

 

I then run this on different PC with very similar hardware:

```

Running array of struct vs struct of arrays on <Intel(R) UHD Graphics 620 [0x3ea0]> ...


Num cells | structure type | runtime average (ms) | runtime worstcase (ms)
----------------------------------------------------------------------------
1000 | Structs | 0.00415342 | 0.01975
1000 | Arrays | 0.0026213 | 0.003
1000 | S Array | 0.00317628 | 0.0045
10000 | Structs | 0.0151256 | 0.016333
10000 | Arrays | 0.00401343 | 0.004416
10000 | S Array | 0.00453666 | 0.006
100000 | Structs | 0.169097 | 0.642666
100000 | Arrays | 0.0272721 | 0.02825
100000 | S Array | 0.0276183 | 0.028333
1000000 | Structs | 4.2028 | 4.79683
1000000 | Arrays | 0.168516 | 0.427083
1000000 | S Array | 0.168989 | 0.334666
16777216 | Structs | 70.515 | 80.0608
16777216 | Arrays | 4.19573 | 5.04425
16777216 | S Array | 4.22549 | 5.05592

```

Array of Structs vs Struct of Arrays.png

 

Here we can see that the array of structs is sometimes 40x slower than the struct of arrays! But the compute:

> 3EA9, 3EA0 Intel® UHD Graphics 620 Gen9 Coffee Lake

 

Is basically the same. From cl-info the only differences are:

```

3c3
< Device Name Intel(R) UHD Graphics 620 [0x3ea0]
---
> Device Name Intel(R) UHD Graphics [0x9bc4]
15c15
< Max clock frequency 1150MHz
---
> Max clock frequency 1250MHz
60c60
< Global memory size 13368696832 (12.45GiB)
---
> Global memory size 53730476032 (50.04GiB)

```

 

So very similar hardware breeds extremely different results. Any ideas? 

 

 

0 Kudos
8 Replies
HemanthCH_Intel
Moderator
2,486 Views

Hi,


Thanks for posting in Intel Communities.


Could you please provide the Os details,dpcpp version and command to compile and run the program.


Thanks & Regards,

Hemanth


0 Kudos
FantasticMrFox
Beginner
2,476 Views

Yes of course!

 

So compiler command is :

 

/opt/intel/oneapi/compiler/2022.1.0/linux/bin/dpcpp -fclang-abi-compat=7 -fsycl --gcc-toolchain=/usr -sycl-std=2020 -no-fma -fp-model=p
recise -Wall -Werror -fsycl  -O2 -g -DNDEBUG src/gpu_common/examples/array_of_structs_vs_struct_of_arrays.cpp -o profile_arr_structs_vs_structs_arr

 

Version of dpcpp is 2022.1.0, but here is the full version info:

/opt/intel/oneapi/compiler/2022.1.0/linux/bin/dpcpp -v
Intel(R) oneAPI DPC++/C++ Compiler 2022.1.0 (2022.1.0.20220316)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2022.1.0/linux/bin-llvm

 

As for OS, there are 2 scenarios:

- The first and identical speed (AKA my PC) is running in a docker running ubuntu 20.04, on a host os running ubuntu 20.04.

- The second with the array of structs being slower is running in a docker running ubuntu20.04 on a host os running ubuntu16.04 with the Real time kernel patch

 

0 Kudos
HemanthCH_Intel
Moderator
2,424 Views

Hi,


Could you please provide the docker file and information of the 2 CPUs to investigate more on your issue?


Thanks & Regards,

Hemanth


0 Kudos
FantasticMrFox
Beginner
2,395 Views

Hi there, so i was able to replicate this using the llvm docker:

 

docker run -it --mount type=bind,source=$(pwd),target=/home/dev/ --privileged ghcr.io/intel/llvm/sycl_ubuntu2004_nightly:latest /bin/bash

 

where  `/home/dev` has the binary in it. 

 

So you should be able to just use this docker. For the CPU's what information do you have in mind? 

I have done 

cat /proc/cpuinfo

And attached the resultant output for the cpu's to this post. 

cpu_information_fast.txt - From the compute with the fast time for SoA.

cpu_information_slow.txt - From the compute with the slow time for SoA.

0 Kudos
HemanthCH_Intel
Moderator
2,341 Views

Hi,


We are working on your issue internally and will get back to you soon.


Thanks & Regards,

Hemanth.


0 Kudos
NoorjahanSk_Intel
Moderator
2,035 Views

Hi,

 

Thank you for your patience.

 

This is a expected behaviour as with SOA performance is better due to good access locality and cache line aligned transaction.

It is always more performant to access cacheline aligned data, data port can merge accesses and limit DataPort LSC to L3$ data traffic to transfer only few cachelines covering whole data block.

 

If you disabled cacheability of resources and next aligned performance to the bottom, then you can observe equal results. We suspect that for some reason of your machine exposes similar degradation.

 

Please set below environment variable to disable cache usage.

export ForceAllResourcesUncached=1

 

Please refer to below link for all available debug capabilities.

https://github.com/intel/compute-runtime/blob/master/shared/source/debug_settings/debug_variables_base.inl

 

Thanks & Regards,

Noorjahan.

 

0 Kudos
NoorjahanSk_Intel
Moderator
1,986 Views

Hi,


We haven't heard back from you. Could you please provide an update on your issue?


Thanks & Regards,

Noorjahan.


0 Kudos
NoorjahanSk_Intel
Moderator
1,946 Views

Hi,


I have not heard back from you, so I will close this inquiry now. If you need further assistance, please post a new question.


Thanks & Regards,

Noorjahan.


0 Kudos
Reply