Engineering Arc - 3/30/2022

Lisa_Pearce · ‎03-30-2022

by Lisa Pearce, Vice President and General Manager for the Visual Compute Group

Excited that launch day is here! Hope everyone is tuning in for our launch event today to check out our overview of the Intel® Arc™ graphics mobile product capabilities and key demos. Just like last time, I want to address three questions I’ve been asked most recently and would be interesting for our community:

Question #1: How does Intel’s new Matrix Engine compare to your Vector Engine in Xe HPG?

To really understand the matrix engine, it helps to understand how data flows through each of our engines. The MAC instruction (multiply accumulate) is the basic SIMD vector instruction used in graphics and is at the heart of our vector engine. The Xe HPG Vector Engine does 8 parallel elementwise
multiplications followed by 8 parallel additions (16 Ops per clock total).

DP4a is an optimization targeting AI workloads when 32-bit floating point precision is not required. It works by dividing all 32-bit inputs into 8-bit chunks and then multiplying the chunks independently. This is a total of 32 parallel multiplications (shown by the purple squares in the diagram below). This is followed by 32 additions for the accumulation or a total of 64 Ops per cycle - a 4x improvement over the standard SIMD MAC.

The matrix engine accelerates further by pipelining the multiply accumulate 4-deep. Like DP4a, each operand is sliced into 4 chunks which are multiplied and accumulated independently, 64 Ops per stage, shown again by the purple tiles. With 4 stages this yields 256 Ops per clock – a 16X increase over the traditional 32-bit SIMD MAC. This type of computational structure is sometimes called a systolic array, which can help accelerate AI applications for gamers and creators.

Question #2: How is Intel improving performance and compatibility for games?

As we start the journey for Intel® Arc™, we have been optimizing our software stack for a wide range of real-world workloads, especially top games and content creation workloads, to give the best possible consumer experience. Most games will run well, but we’ve found a few compatibility issues on our Intel Arc products that show up in applications developed before the products existed. We are improving our process over time and have developed a collection of techniques that don’t require game updates when the title is no longer in development, but instead can be done solely in the driver. This includes identifying specific application processes and then passing flags to our compiler for efficient memory utilization, optimizations of what type of SIMD instruction is preferred, and even wholesale shader replacements to ensure optimal hardware performance.

Shaders are small programs that games execute on the GPU and are written in a specialized programming language (for example HLSL) that Intel’s compilers convert into GPU code. The Intel Arc graphics products are a new entrant to the discrete GPU landscape, and a lot of shader programs assumed certain characteristics about the underlying GPU architecture that simply don’t match well with the Xe HPG architecture.

Our approach uses shader replacement to ensure games and workloads run smoothly on Intel Arc graphics. This technique detects when specific shader programs are loaded and modifies the shaders directly – while preserving bit accurate output. By modifying the shader program, we can greatly improve compatibility and sometimes performance for older titles. Typical modifications include loop unrolling, early out on branches, and removing redundant calculations. We have significantly increased our investment in game tuning and now have shader replacements for many top titles.

As part of our effort to optimize software capabilities, we tested some application-specific optimizations in popular benchmarks from UL, including 3DMark Time Spy and Port Royal, and then applied some of those optimizations to benefit real-world games. But other optimizations have not yet scaled into general game performance uplift and remain applicable only to Time Spy and Port Royal for now.

We have included these benchmark specific optimizations in the version of our driver we are releasing today on Intel.com so that we can show the full capabilities of our new Arc graphics products. For example, on Time Spy, we see an impact of approximately 15% when benchmark specific optimizations are implemented, depending on the specific Intel Arc graphics SKU. We informed UL that these benchmark optimizations will be enabled in our initial releases and, aligning with UL’s benchmark guidelines, our driver will not be a UL-approved driver for now.

By the end of April, we will add a UI option that allows users to toggle these benchmark specific optimizations on and off. This gives anyone the ability to see the top-level Intel Arc hardware potential of a fully optimized workload, as well as general benchmark performance. When the toggle is in place and the benchmark optimizations are disabled by default, the driver will be eligible for approval by UL.

We are committed to working with game developers and software ecosystem to bring the best experience to our mutual end users. Today we have 80 out of the top 100 applications fully functional and we are growing that number every week. Within the 80 functional, 14 of these titles are raytracing enabled and fully functional on Intel Arc graphics. Since we are engaging much earlier with game developers, we expect the number of functional issues will go down quickly over time.

I’d also like to note an awesome GitHub site (https://github.com/IGCIT/Intel-GPU-Community-Issue-Tracker-IGCIT) was developed last year that helped us accelerate feedback on top game issues from the community including for existing products. The feedback is always helpful! Special thanks to the IGCIT contributors.

Question #3: Why is AV1 encode as part of the new Xe Media Engine a benefit?

Our new media engine includes built-in hardware acceleration for the broadest set of codecs in the industry including HEVC, H.264/AVC, and VP9. On top of that, Intel Arc graphics is the first in the industry to support hardware accelerated AV1 decoding AND encoding. Direct hardware support for AV1 encoding provides a 50x improvement in encoding speed1 compared to traditional software implementations!

We’ve been working with industry partners to ensure that Intel Arc graphics AV1 support is available in many popular media applications, with broader adoption expected throughout the year. AV1 hardware acceleration is currently supported in the streaming application XSplit and in creator applications including Handbrake, Adobe Premiere Pro and Blackmagic Design’s Davinci Resolve Studio. I am looking forward to seeing what our community can do with the AV1 hardware support that will be available across the entirety of our Intel Arc A-Series family of graphics.

-Lisa

Performance Disclosure:
1. Intel® Arc™ A370M delivers 50x faster encoding with AV1 hardware acceleration compared to Intel® Core™ i7-12800H with Intel® Iris® Xe Graphics using software encode. Processor: Intel® Core™ i7-12800H, Pre Production ADL-H w/Alchemist SoC, BIOS: ADLPFWI1.R00.3091.A00.2202211056, Integrated Graphics: Intel® Iris® Xe Graphics, Integrated Graphics Driver: 30.0.101.1320, Discrete Graphics: Intel® Arc™ A370M Graphics, Discrete Graphics Driver: 30.0.101.1320, Memory: 16GB (2x8GB) DDR5 @ 4800MHz, Storage: INTEL SSDPEKKF512G7 512GB, OS: Windows 11 Version 10.0.22000.556. The AV1 workload measures the time it takes to transcode a 4K/30fps AVC @ 57.9Mbps clip to 4K/30fps AV1 @ 30Mbps High Speed format. The comparison for the 50x claim is using the Alder Lake CPU (software) to transcode the clip on a public FFMPEG build versus Alchemist (hardware) on a proof-of-concept Intel build. As measured Mar 15-16, 2022.

Notices and Disclaimers:
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See configuration disclosure for details.