Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
791 Discussions

Intel Labs Works with Hugging Face to Deploy Tools for Enhanced LLM Efficiency

Moshe_Wasserblat
Employee
0 0 1,434

These blog posts were written by Moshe Wasserblat,  Oren Pereg, Jonathan Mamou, Daniel Korat, and Moshe Berchansky, with university and industry partners Nadav Timor, Joao Gante, Lewis Tunstall, and Roy Schwartz as part of their joint research efforts with Intel Labs.

Highlights:

  • Large Language Models are revolutionizing AI applications; however, slow inference speeds continue to be a significant challenge. Intel researchers, along with industry and university partners, are actively working to address this issue and accelerate the efficiency of LLMs.
  • In a series of blog posts, Intel Researchers introduce several novel works, including a method that accelerates text generation by up to 2.7 times, a method that extends assisted generation to work with a small language model from any model family, and a technique that enables any small “draft” model to accelerate any LLM, regardless of vocabulary differences.

Large Language Models (LLMs) are revolutionizing AI applications, powering everything from chatbots to code generation. Despite their capabilities, slow inference speeds remain a significant challenge. Speculative Decoding (SD) presents a promising solution by accelerating text generation through multi-token prediction. However, conventional SD approaches typically require the assistant and target models to use the same vocabulary. Since many LLMs lack smaller, lightweight counterparts to act as assistant models, this requirement limits the flexibility and broader adoption of SD techniques.

In a series of blog posts, Intel researchers, along with industry and university partners, introduce several innovative methods to enhance the efficiency of LLMs, including a novel method that accelerates text generation by up to 2.7 times, a method that extends assisted generation to work with a small language model from any model family, and a technique that enables any small “draft” model to accelerate any LLM, regardless of vocabulary differences. Learn more about these breakthrough technologies in the summaries below.

Faster Assisted Generation with Dynamic Speculation

This blog post introduces dynamic speculative decoding, a novel method developed by Intel labs and Hugging Face that accelerates text generation by up to 2.7 times, depending on the task. Dynamic speculation has been integrated into release 4.45.0 of the Hugging Face Transformers library and now serves as the default operation mode for assisted decoding. 

Universal Assisted Generation: Faster Decoding with Any Assistant Model

Many LLMs lack a smaller version to use for assisted generation. To mitigate this issue, Intel Labs, in collaboration with our friends at Hugging Face, has developed Universal Assisted Generation (UAG): a method that extends assisted generation to work with a small language model from any model family. As a result, it is now possible to accelerate inference from any decoder or Mixture of Experts model by 1.5 to 2.0 times with almost zero overhead.

Speeding Up LLM Decoding with Advanced Universal Assisted Generation Techniques

This blog post introduces UAG-TLI, an extension of UAG that enables probabilistic coding and allows the use of any small LM, thus delivering enhanced speed boosts. Experiments with state-of-the-art LLMs demonstrate speedups of up to 2.5 times. UAG-TLI method is now integrated into Transformers release 4.50.0 as part of Assisted Generation (AG), making advanced AG more accessible. 

Intel and Weizmann Institute Speed AI with Speculative Decoding Advance

This work introduces a significant advancement in speculative decoding: a new technique which enables any small “draft” model to accelerate any LLM, regardless of vocabulary differences.  This innovation opens the door for flexible LLM deployment, enabling developers to pair any small draft model with any large model to optimize inference speed and cost across platforms. The method delivers performance gains of up to 2.8 times faster inference without compromising output quality; the algorithms are already integrated into the Hugging Face Transformers open-source library.

Tags (2)
About the Author
Mr. Moshe Wasserblat is currently Natural Language Processing (NLP) and Deep Learning (DL) research group manager at Intel’s AI Product group. In his former role he has been with NICE systems for more than 17 years and has founded the NICE’s Speech Analytics Research Team. His interests are in the field of Speech Processing and Natural Language Processing (NLP). He was the co-founder coordinator of EXCITEMENT FP7 ICT program and served as organizer and manager of several initiatives, including many Israeli Chief Scientist programs. He has filed more than 60 patents in the field of Language Technology and also has several publications in international conferences and journals. His areas of expertise include: Speech Recognition, Conversational Natural Language Processing, Emotion Detection, Speaker Separation, Speaker Recognition, Information Extraction, Data Mining, and Machine Learning.