HoneyBee: Intel Labs and Mila Collaborate on State-of-the-Art Language Model for Materials Science

Santiago_Miret · ‎12-11-2023

Santiago Miret is an AI research scientist at Intel Labs, where he focuses on developing artificial intelligence solutions and exploring the intersection of AI and the physical sciences.

Highlights:

Intel Labs and Mila collaborate on HoneyBee, a large language model specialized to materials science.
The team uses MatSci-Instruct, an instruction-based process for trustworthy data curation in materials science, to fine-tune HoneyBee.
HoneyBee is the first open-source billion parameter-scale language model specialized to materials science achieving state-of-the-art performance on the open source MatSci-NLP benchmark.

Building on Intel and the Mila - Quebec AI Institute's continued research efforts to develop novel AI tools for materials discovery to address challenges such as climate change and sustainable semiconductor manufacturing, Intel Labs and the Bang Liu group at Mila have collaborated on HoneyBee, a state-of-the-art large language model (LLM) specialized to materials science now available on Hugging Face. HoneyBee was recently accepted as a Findings poster presentation at Empirical Methods in Natural Language Processing (EMNLP 2023), as well as a spotlight at the AI for Accelerated Materials Discovery (AI4Mat) Workshop at the Conference on Neural Information Processing Systems (NeurIPS 2023).

As described in our Intel Labs and Mila collaboration on the MatSci-NLP paper and blog, materials science is a complex interdisciplinary field that seeks to understand the interaction of matter to effectively design, fabricate, and analyze new materials systems. The vast amount of research literature and textual information contained in diverse documents creates an opportunity to design specialized scientific LLMs that can understand domain-specific scientific language as well as specialized text, such as chemical and mathematical formulas. To that end, we developed HoneyBee, the first open-source billion parameter-scale LLM specialized to materials science that has achieved state-of-the-art performance on our open source MatSci-NLP benchmark.

Trustworthy Training Data Generation Using MatSci-Instruct

One particular challenge in developing LLMs for materials science is the lack of high-quality annotated scientific textual data. This challenge is further compounded by the fact that much of scientific knowledge is contained in domain-specific language that has precise meaning for a given scientific context. Due to the importance of high-quality data, a trustworthy process is required to compile training and evaluation data for scientific LLMs. While expert annotation is the most desired option for annotation, it is unfeasible to perform at scale. To address the challenge of creating high-quality textual data, we propose MatSci-Instruct, a trustworthy instructions data generation process that can be used to generate fine-tuning data for LLMs in scientific domains, specifically materials science. MatSci-Instruct builds upon two main insights:

We can mitigate bias and introduce further robustness by evaluating generated fine-tuning data using multiple, independent LLMs thereby creating trustworthiness for both the generated data and the resulting LLM itself.
LLMs of great scale have shown emergent abilities in domains in which they were not initially trained, and can be further refined for specific domains using instruction-based fine-tuning.

Progressive Fine-Tuning of Materials Science Language Models

Figure 1 HoneyBee.png

Figure 1. MatSci-Instruct generates instruction-based data using independent LLMs for greater robustness. The data is then used to train HoneyBee, a specialized materials science LLM. The process of data generation and fine-tuning is repeated iteratively, leading to progressive improvement of HoneyBee’s performance.

Figure 1 shows the primary workflow for domain-specific materials data generation using MatSci-Instruct, which is then used to train HoneyBee, a materials science LLM. The process follows three primary steps:

Generation: Materials science text data generation by the Instructor (ChatGPT) which provides the basis for LLM fine-tuning data.
Verification: The data generated by the Instructor is verified using an independent Verifier LLM (Claude) to filter out low-quality data using predetermined criteria.
Model fine-tuning and evaluation: The verified data is used to train HoneyBee language models, which are then evaluated by an additional independent LLM, the Evaluator (GPT-4).

Figure 2 HoneyBee.png

Figure 2. Materials science topics covered by MatSci-Instruct to train HoneyBee.

The three steps above are iteratively repeated to progressively improve the performance of HoneyBee language models with each additional cycle. Both the quality of the generated materials science text data and the quality of the HoneyBee LLMs improve with each refinement. As shown in Figure 2, the MatSci-Instruct generated data spans a diverse set of relevant materials science topics, which is necessary to effectively train LLMs on complex scientific domains.

HoneyBee Language Models

Figure 3 HoneyBee.png

Figure 3. Correlation between scores of the Verifier LLM and expert evaluation show generally good agreement.

To better understand the effectiveness of MatSci-Instruct and the performance of HoneyBee, our paper outlines various experiments. We first study the correlation between the verification results from the Verifier and Evaluator models with the evaluation from human experts. As shown by Figure 3, the relatively high correlation between the evaluation by human experts and the LLMs shows good agreement between the two methods. This suggests that the LLMs used in the MatSci-Instruct process can be used to generate trustworthy fine-tuning data.

Figure 4 HoneyBee.png

Figure 4. Progressive fine-tuning of HoneyBee shows consistent improvement in model performance.

Next, we study the performance of HoneyBee models as they undergo progressive fine-tuning. Figure 4 shows two relevant findings:

Both HoneyBee-7b and HoneyBee-13b, each representing the number of parameters in the LLM, show progressive improvement with each fine-tuning iteration. This provides evidence to support the efficacy of the iterative process.
In some cases, highlighted in light yellow, HoneyBee-13b is able to exceed the performance of the original Instructor (ChatGPT). This behavior has also been observed in other studies of instruction fine-tuned LLMs, further indicating the value of MatSci-Instruct.

Figure 5 HoneyBee.png

Figure 5. Low-resource fine-tuning and zero-shot evaluation results for various HoneyBee on MatSci-NLP tasks. Macro-F1 (top) and micro-F1 (bottom) scores are highlighted in dark yellow for best, yellow for second-best, and light yellow for third-best performing LLM.

Finally, we study the performance of HoneyBee language models on the MatSci-NLP benchmark (see Figure 5). We follow the same procedure described in the MatSci-NLP paper and find that HoneyBee outperforms all LLMs in the original MatSci-NLP analysis. In the zero-shot setting, where LLMs evaluate the benchmark data without any additional training, HoneyBee outperforms all LLMs except for GPT-4, which was the Evaluator in MatSci-Instruct. Nevertheless, HoneyBee-13b achieves competitive performance with GPT-4 while having significantly fewer (up to 10x) parameters. This speaks to the high degree of specialization achieved through HoneyBee, making it a state-of-the-art language model for materials science.