Listening to the World: A Breakthrough in AI

Keith_Achorn · ‎12-07-2021

Listening to the World: A Breakthrough in AI

An Intel Engineer Helps to Catalog World Languages

By Keith Achorn, AI Framework Engineer, Software and Advanced Technology Group (SATG)

Very often, Machine Learning and Artificial Intelligence seem like lofty technologies that are run on remote and complex super computers solving obscure problems. But I recently had an opportunity to work with an amazing group of engineers and researchers on an impactful project that brought AI home.

Starting in 2019, a working group formed under the auspices of ML Commons to enhance and democratize speech recognition technologies by creating speech datasets which are large-scale, diverse, and openly licensed. This project has so far yielded two cutting-edge datasets involving 50 of languages from around the world. Members of the group come from Intel, Harvard, Alibaba, Oracle, Landing AI, University of Michigan, Google, Baidu and others.

Two whitepapers describing these spoken language datasets will be presented Dec. 7 at the annual Conference on Neural Information Processing Systems (NeurIPS). The first paper is called The People’s Speech and the second is Multilingual Spoken Words Corpus (MSWC). The People’s Speech targets “automatic speech recognition” tasks, and MSWC involves “keyword spotting.” When training Machine Learning models, larger datasets generally produce more accurate results. And these datasets are each among the largest public collections of speech data in their respective classes.

How does this affect peoples’ everyday lives? By training on these datasets, a computer or other device can “hear” spoken words and take an appropriate action, such as responding to a user’s query or producing an automatic transcript. And in today’s diverse, international, multi-lingual work environment, the ability to do this accurately will become increasingly important.

Both projects utilize “diverse speech,” which means they better represent a natural environment, complete with background noise, informal speech patterns, a mixture of recording equipment, and different acoustic environments. This stands apart from highly-controlled content such as audiobooks, which are more “sanitized.” Training on diverse speech has been correlated with better accuracy in real-world use.

The People’s Speech project includes tens of thousands of hours of supervised conversational audio. It is now among the world’s largest English speech recognition datasets licensed for academic and commercial usage, and is free to download.

MSWC is an audio speech dataset that has over 300,000 keywords in 50 languages, and can be used to train smart devices. The MSWC is dataset spans languages spoken by over 5 billion people, and advances the research and development of voice applications for a wide global audience.

It’s appropriate that the researchers who developed these datasets were an international group spanning multiple continents. We met weekly over years, via conference call, with each bringing a particular expertise to the project.

Both datasets will be widely available for researchers and developers. They are available with extremely open licensing terms, including for commercial use. The importance of proper licensing is an under-recognized constraint, leading many promising datasets to be limited in their usability and scale.

Both datasets will be maintained for longevity by MLCommons -- a consortium of global technology providers, academics, and researchers -- of which Intel is a founding member.

In the world of language AI, this project has been a leap forward. It also opens many possibilities for the future. I look forward to working with some of my colleagues and taking it to the next step.

Keith Achorn is an AI Framework Engineer in Intel’s Software and Advanced Technology Group (SATG)