MatSci-NLP: Intel Labs and Mila Collaborate on Benchmark to Assess Materials Science Language Models

Santiago_Miret · ‎08-23-2023

Santiago Miret is an AI research scientist at Intel Labs, where he focuses on developing artificial intelligence solutions and exploring the intersection of AI and the physical sciences.

Highlights:

Intel and Mila collaborate on MatSci-NLP, the first broad benchmark for assessing the capabilities of language models on understanding materials science language and performing useful tasks for materials scientists.
The open-source benchmark could enable faster discovery, synthesis, and deployment of new materials into a wide variety of applications, including clean energy, sustainable manufacturing, and devices.
MatSci-NLP’s unified task-schema method is the first-of-its-kind for materials science and can easily be applied to other fields.

Working towards the goal of creating capable language models for materials science using advanced artificial intelligence (AI), Intel and the Mila - Quebec AI Institute have created MatSci-NLP, the first broad benchmark for assessing the capabilities of language models on understanding materials science language and performing useful tasks for materials scientists. Recently published at the prestigious Association for Computational Linguistics 2023 conference by Intel Labs and Mila’s Bang Liu group, MatSci-NLP contains textual data spanning a wide range of materials, such as glasses, inorganic materials, and superconductors. The open-source benchmark uses natural language processing (NLP) tasks to assess a language model’s ability to understand materials science language and perform tasks to enable faster discovery, synthesis, and deployment of new materials into a wide variety of applications, including clean energy, sustainable manufacturing, and devices. MatSci-NLP presents a tremendous opportunity to develop and build NLP tools to create language models like ChatGPT that have a useful understanding of materials science.

The discovery of new materials is critical for addressing social and environmental challenges, such as climate change and sustainable semiconductor manufacturing. Materials science, which studies the behavior, properties, and applications of new and existing materials, is a highly interdisciplinary field that intersects with physics, chemistry, and biology as well as many engineering fields. The vast diversity of materials systems — such as metals, semiconductors, biomaterials, and organic molecules — and their interactions make the process of discovering and creating innovative materials both challenging and interesting.

Building the MatSci-NLP Benchmark

While a vast amount of materials science knowledge has been recorded in textual format, such as scientific journals, patents, and technical reports, the ease of access to such materials is often restricted due to copyright protections and the difficulty of extracting data from PDF formats. Given these challenges, MatSci-NLP was created by unifying publicly available, high quality, smaller-scale datasets to form a benchmark for fine-tuning and evaluating NLP models for materials science applications. MatSci-NLP consists of seven NLP tasks, including conventional NLP tasks such as named entity recognition and relation classification, as well as NLP tasks specific to materials science, such as synthesis action retrieval for creating synthesis procedures for materials. The benchmark spans a wide range of materials categories, including fuel cells, glasses, inorganic materials, superconductors, and synthesis procedures related to various kinds of materials.

Analysis of BERT-Based Language Models

We used our newly created benchmark to conduct a wide-ranging analysis of scientific language models performance on MatSci-NLP. In our analysis, we mainly focused on comparing language models that were pre-trained on different kinds of scientific text. Past researchers hypothesized that pre-training language models on scientific texts, including different kinds of materials science journals, would imbue the language model with more domain-specific knowledge compared to pre-training on general language.

Figure 1 MatSci-NLP.png

Figure 1: MatSci-NLP analysis of different BERT-based language models on different tasks and overall performance across the benchmark. Dark orange shows the best performance and light orange represents outperformance of the general language baseline. Most scientific language models, including ones trained on adjacent scientific domains like biology, outperform the vanilla BERT on MatSci-NLP.

To control for the effects of different architectures and model sizes, we performed the analysis on language models that applied the BERT transformer architecture that revolutionized NLP upon its release in 2018. The main difference between the various models was the text corpus used for pre-training, which provided an opportunity to understand the effects of altering the type of text used for language model performance. As shown from our results in Figure 1, using high-quality domain specific text data generally helps in improving a model’s understanding of materials science text as shown by better performance on MatSci-NLP. This trend, however, does not apply universally across all language models, suggesting that data quality is very important in determining model capabilities.

Unified Task Schema for Multi-Task Language Modeling

Figure 2 MatSci-NLP.png

Figure 2. Unified text-to-schema method for MatSci-NLP text understanding applied across seven tasks. The language model includes a domain specific encoder, which can be exchanged in a modular manner, as well as a general language pre-trained transformer decoder.

In addition to our analysis of different language models, we investigated the effect of different data formats on model performance on MatSci-NLP. We observed that data formats had a significant effect on language model performance for all the models analyzed. As a result, we developed a new, unified schematic that unifies all tasks in MatSci-NLP. Our unified task-schema method is the first-of-its-kind for materials science and can easily be applied to other fields where practitioners need a language model to perform multiple tasks effectively.

Figure 3 MatSci-NLP.png

Figure 3: Analysis of different text input formats on MatSci-NLP across different language models. Our proposed task-schema method increases performance for all models and generally performs best for all tasks in MatSci-NLP.

The release of MatSci-NLP is the first step in building capable language models for materials science that can help scientists accelerate the development of new materials systems for a more sustainable world. Today and in the future, we plan to expand the range of tasks and data available in our benchmark and take advantage of more powerful language models that have been recently released by AI researchers.