Artificial Intelligence (AI)
Engage with our experts on topics in AI
Intel Customer Support will be observing the Martin Luther King holiday on Monday, Jan. 17, and will return on Tues. Jan. 18.
For the latest information on Intel’s response to the Log4j/Log4Shell vulnerability, please see Intel-SA-00646
185 Discussions

Breaking the Latency Barrier for Real-Time Neural Machine Translation

0 0 327

Authors: Sidharth N. Kashyap, Manos Farsarakis, Krzysztof Olinski, Nikhil Deshpande

At a Glance

  • In recent third-party benchmarking at WMT21, one core of a 3rd Gen Intel® Xeon® Scalable processor surpassed 10ms latency threshold for real-time neural machine translation (NMT).
  • A single CPU core achieved better quality in 8.9ms than an A100 GPU achieved in 13.9ms. This is 1.5x faster.
  • Organizations can deploy real-time, multilanguage services on flexible, cost-effective Intel Xeon Scalable processor-based infrastructure.

Computers have been used for language translation for more than 50 years. More recently, neural machine translation (NMT) has transformed the work of translating text between languages. Applying deep learning to text translation, NMT produces results that are generally faster and more accurate than older statistical and rule-based methods [1]. 

The market for machine translation is expanding, and NMT is responsible for a significant portion of the growth [2]. NMT is the technology behind services such as Google Translate and Microsoft Translator, each of which translates text between more than 100 languages and dialects [3] [4].

But NMT solutions for real-time translation are latency sensitive, particularly when NMT is used for delivering interactive experiences. The industry consensus is that real-time systems should achieve good quality with latency of less than 10 ms.[5] Google has stated that 7 ms is an optimal latency target for image- and video-based uses [6].

In recent benchmarks, members of the NMT community surpassed these requirements on 3rd Gen Intel Xeon Scalable processors, achieving pareto optimal translation quality with latency under 8.9 ms. Their work signals that CPU-based real time NMT is now a practical reality, opening the door to real-time, multilingual scenarios in retail, healthcare, and other industries.

About the Benchmarks

The Conference on Machine Translation organized a competition on efficient machine translation, continuing an event held for the past four years [7], prior to which it was known as the Workshop on Machine Translation. This year’s event is known as WMT21 and it is collocated with the Empirical Methods in Natural Language Processing conference.

The WMT21 efficiency task required participants to translate 1 million lines of English to German. Participants provided their own code and models, and tests were run on standardized input files and hardware. The hardware options for latency studies were:

  • A CPU option: one core of a dual-socket, 3rd Gen Intel Xeon Gold 6354 processor running on Oracle Cloud BM.Optimized3.36
  • A GPU option: one NVIDIA A100 GPU running on Oracle Cloud BM.GPU4.8

Team Edinburgh and Huawei Translation Service Center (TSC) submitted optimized models to the latency track and exploited the 3rd Gen Intel Xeon Scalable processor’s hardware acceleration features. Optimizations made extensive use of Intel Advanced Vector Extensions 512 (Intel AVX-512), including Vector Neural Network Instructions (VNNI).

Optimizations also took advantage of the ICX platform’s memory subsystem with large capacity and bandwidth. The matrix multiplications were optimized using customized Integer Matrix Multiplication libraries and Intel oneDNN.

The submissions from the Team Edinburgh used MarianNMT, an efficient open source engine that powers Microsoft translation products. Marian originated at the University of Edinburgh and Adam Mickiewicz University. Microsoft and the NMT community currently maintain the software, and it is widely used in business and government.


Reflecting the factors that comprise a successful translation engine, WMT21 measured submissions on latency, quality, throughput, and size—both the model size on disk and memory consumption. Our focus is on the latency results since they highlight the most exciting breakthrough for real-time translation. Full results can be found here.

Figure 1 maps the latency and quality results for both WMT21 hardware options. The benchmarks measured latency as the average time to translate a single sentence and flush the buffer.

The results below report quality tests using the COMET 1.0.rc2 metric for evaluating machine translation quality. In addition, Microsoft has conducted a focused human evaluation of the translation quality.

Figure 1 shows the results in relation to a stair-step line representing the Pareto frontier—the optimal combination of speed and quality. The happy-face icon shows the intended direction for an ideal system.

Figure 1. WMT21 Latency and quality results.Figure 1. WMT21 Latency and quality results.

Real-World Implications

The latency breakthroughs reported at WMT21 give organizations greater flexibility to deploy real-time, multilanguage applications and services that satisfy user expectations for responsiveness and interactivity. Organizations can deploy these applications with high performance on Intel Xeon Scalable processor-based infrastructure, maintaining a consistent environment for inferencing workloads and other enterprise computing.

We expect these latency breakthroughs to lead to a range of new applications that combine real-time translation with other AI and business capabilities. Retailers can reduce friction and improve customer satisfaction by finding more powerful ways to interact with customers around the world. Healthcare organizations can create AI-enhanced interfaces that save time for patients and providers while mitigating language barriers.

Intel continues to advance each generation of Intel Xeon Scalable processors with capabilities designed to accelerate performance. We look forward to further improvements in NMT latency, as well as in throughput, quality, and other metrics, in the months and years to come. We’re sharing the WMT21 models on the oneContainer portal to make them available to developers who want to build on our work.

Read Kenneth Heafield’s full report on the WMT21 benchmarks. Findings of the WMT 2021 Shared Task on Efficient Translation

Access the WMT21 models on the Intel oneContainer portal.

Explore MarianNMT.

Learn about AI on 3rd Gen Intel Xeon Scalable processors.


[1] P. Koehn, Neural Machine Translation. Cambridge: Cambridge University Press, 2020.

[2] Mordor Intelligence, Machine Translation Market–Growth, Trends, Covid-19 Impact, and Forecasts (2021-2026),

[3] Nick Statt, Google Translate Supports New Languages for the First Time in Four Years, Including Uygur, February 26, 2020,

[4] John Roach, Azure AI Empowers Organizations to Serve Users in More Than 100 Languages, October 11, 2021,

[5] Sid Sharma, What is Conversational AI? NVIDIA Blog, Feb. 25, 2021,

[6] Norman P. Jouppi, Cliff Young, et al, In-Datacenter Performance of a Tensor Processing Unit, 44th International Symposium on Computer Architecture, June 26, 2017,

[7] Findings of the WMT 2021 Shared Task on Efficient Translation,


Performance varies by use, configuration, and other factors. Learn more at Your costs and results may vary.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
Intel doesn’t control or audit third-party data. Consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.