Solved: Re: Double precison on ARC GPU?

caplanr · ‎01-25-2023

Hi,

I was thinking of purchasing an ARC A770 Limited Edition in order to use it to develop/test our Fortran codes for future use with Data Center MAX GPUs.

However, from some sites on the internet, it seems that the ARC GPUs do not have any double precision (DP) floating point units in them at all. As our codes are memory bandwidth bound, even a minimal amount of DP units would be enough (for example, we can efficently run our codes on NVIDIA GeForce cards even though they have 1/32 the DP units as their data center counterparts).

I also have seen that apparently, the ARC GPUs can "emulate" DP depending on the compiler settings (or according to one post, depending on the name of the executable!).

Could you please clear this up?

Can I compile and run my DP Fortran codes with "do concurrent" and/or OpenMP target offload with IFX on an ARC GPU such as the A770 (even in an emulated mode)?

Thanks!

- Ron

Ron_Green · ‎01-26-2023

I did find the compiler env vars you referred to.

Our Fortran compiler uses a separate compiler, the Intel Graphic Compiler (IGC) for compiling the device code. That compiler comes from another group, our graphics group, in Intel.

Here are the env vars I am told to use to support emulation:

Before compilation

export IGC_EnableDPEmulation=1

runtime: 
export SYCL_DEVICE_WHITE_LIST=""
export OverrideDefaultFP64Settings=1

View solution in original post

JohnNichols · ‎01-25-2023

Meanwhile, the XMX blocks are comparable to Nvidia's Tensor cores. Each XMX unit can handle either FP16/BF16 (16-bit floating point/brain floating point), INT8 (8-bit integer), or INT4/INT2 (4-bit/2-bit integer) data. These blocks are useful with deep learning workloads, including Intel's XeSS (Xe Super Sampling) upscaling algorithm. They can be used in any workload that just needs a lot of lower-precision number crunching, and each XMX block can do either 128 FP16, 256 INT8, or 512 INT4/INT2 operations per clock.

JohnNichols · ‎01-26-2023

http://www.cburch.com/books/float/

This is quite an interesting read on number representation related that relates to Fortran.

On reading the stuff on Fixed Point, I drawn to the idea of primes as a representation system for numbers, the top row is the primes

the cells are the counting for the primes and the last column is the number, the columns are simple repetition, the primes are simply numbers that represent a 1 only in the row. I do not have to do any multiplications I follow the pattern and read the ones.

It reminds me of the wheels on the Enigma machine. Add the prime to the prime list and move on.

Now you will all tell me why this will not work.

1	2	3	5	7	11	13
0	1	0	0	0	0	0	2
0	0	1	0	0	0	0	3
0	2	0	0	0	0	0	4
0	0	0	1	0	0	0	5
0	1	1	0	0	0	0	6
0	0	0	0	1	0	0	7
0	3	0	0	0	0	0	8
0	0	2	0	0	0	0	9
0	1	0	1	0	0	0	10
0	0	0	0	0	1	0	11
0	2	1	0	0	0	0	12
0	0	0	0	0	0	1	13
0	1	0	0	1	0	0	14

Ron_Green · ‎01-26-2023

Make sure your OS has a correct driver for the ARC before you buy it.

Support is best on Linux. To date, we have far more GPU offload users on Linux due to our HPC community. Linux drivers are here.

Windows drivers are here.

No macOS support for IFX or offload.

As for FP64 support - perhaps others on here can comment. There is FP32 hardware support. ARC is designed for graphics, AI, gaming.

Ron_Green · ‎01-26-2023

I did find the compiler env vars you referred to.

Our Fortran compiler uses a separate compiler, the Intel Graphic Compiler (IGC) for compiling the device code. That compiler comes from another group, our graphics group, in Intel.

Here are the env vars I am told to use to support emulation:

Before compilation

export IGC_EnableDPEmulation=1

runtime: 
export SYCL_DEVICE_WHITE_LIST=""
export OverrideDefaultFP64Settings=1

caplanr · ‎01-26-2023

Hi,

Thanks!

I have heard that one needs Linux kerenel 6 and up to have ARC work, but that should not be a problem.

Any idea on the performance hit by using emulations versus natvie support?

I.e. lets say my memory bound code would noramly run at "X" speed with "Y" memory bandwidth in native DP, what do you think the perfomrance loss would be running it "DP emulation mode"?

Also, are there ANY Intel GPUs available for purchase (even integrated ones like Iris) that have native DP support yet?

Will there be a "more consumer firendly" version of the MAX GPUs (similar to NVIDIA's "Titan" line)?

Thanks again!

- Ron

Ron_Green · ‎01-26-2023

Emulation is order of magnitude(s) slower than hardware. roughly 10-100 times slower, more towards 100x slower.

So if you can, stick to FP32. It's a major hit.

1	2	3	5	7	11	13
0	1	0	0	0	0	0	2
0	0	1	0	0	0	0	3
0	2	0	0	0	0	0	4
0	0	0	1	0	0	0	5
0	1	1	0	0	0	0	6
0	0	0	0	1	0	0	7
0	3	0	0	0	0	0	8
0	0	2	0	0	0	0	9
0	1	0	1	0	0	0	10
0	0	0	0	0	1	0	11
0	2	1	0	0	0	0	12
0	0	0	0	0	0	1	13
0	1	0	0	1	0	0	14

1	2	3	5	7	11	13
0	1	0	0	0	0	0	2
0	0	1	0	0	0	0	3
0	2	0	0	0	0	0	4
0	0	0	1	0	0	0	5
0	1	1	0	0	0	0	6
0	0	0	0	1	0	0	7
0	3	0	0	0	0	0	8
0	0	2	0	0	0	0	9
0	1	0	1	0	0	0	10
0	0	0	0	0	1	0	11
0	2	1	0	0	0	0	12
0	0	0	0	0	0	1	13
0	1	0	0	1	0	0	14

1	2	3	5	7	11	13
0	1	0	0	0	0	0	2
0	0	1	0	0	0	0	3
0	2	0	0	0	0	0	4
0	0	0	1	0	0	0	5
0	1	1	0	0	0	0	6
0	0	0	0	1	0	0	7
0	3	0	0	0	0	0	8
0	0	2	0	0	0	0	9
0	1	0	1	0	0	0	10
0	0	0	0	0	1	0	11
0	2	1	0	0	0	0	12
0	0	0	0	0	0	1	13
0	1	0	0	1	0	0	14