Santiago Miret is an AI research scientist at Intel Labs, where he focuses on developing artificial intelligence solutions and exploring the intersection of AI and the physical sciences.
Highlights:
- Intel Labs recently released the Open MatSci ML Toolkit version 1.0 on August 31, making training of advanced AI models on materials data more accessible for materials evaluation and discovery.
- Open MatSci ML Toolkit 1.0 uses over 1.5 million solid-state materials data points sourced from a diverse set of open datasets, including Open Catalyst, Materials Project, OQMD, NOMAD, and the Carolina Materials Database.
- The toolkit also provides a generative pipeline for creating new crystal structures, and support for two major graph neural network frameworks (DGL and PyG).
Intel Labs recently released the Open MatSci ML Toolkit version 1.0 on August 31, making training of advanced artificial intelligence (AI) models on materials data more accessible for materials evaluation and discovery. Complex societal challenges, such as climate change, sustainable agriculture, and compute accessibility, rely heavily on the discovery of new materials systems. The vast design space in materials design, and the complex physical and chemical relationships in applications requires advanced AI tools to evaluate and discover new materials more efficiently.
An earlier 2022 pre-release of the toolkit published in Transactions on Machine Learning journal emphasized a streamlined approach for model training on the Open Catalyst dataset, enabling different scales of compute resources across different hardware platforms, including CPU, GPU, and AI accelerators. Available under an open-source MIT license on GitHub, Open MatSci ML Toolkit 1.0 and its associated paper include general quality of life improvements with additional features, including:
- More than 1.5 million solid-state materials data points sourced from a diverse set of open datasets, including the Materials Project, OQMD, NOMAD, and the Carolina Materials Database with pre-packaged datasets available for download.
- A generative pipeline for creating new crystal structures based on state-of-the-art diffusion models.
- Support for DGL and PyG, two major graph neural network frameworks.
The new version of the toolkit retains support for the Open Catalyst dataset and all the scaling and usability advantages of the original pre-release, enabling the open-source and research communities to build more generalist materials science models.
Solid-State Materials Datasets
The solid-state materials data in the Open MatSci ML Toolkit were sourced from open-source datasets of simulated materials properties through Density Functional Theory (DFT). Based on quantum mechanics, DFT is considered the gold standard for computational materials modeling, but simulation can often be costly and slow to converge to reliable results. To overcome these obstacles, advanced AI models based on geometric deep learning trained on DFT-based materials data have the potential to speed up evaluation of materials behavior and properties by 10x to 100x.
Table 1 : Materials data from various open-source datasets included in the Open MatSci ML Toolkit 1.0.
The Open MatSci ML Toolkit has curated over 1.5 million datapoints from ground-state DFT calculations as shown in Table 1 under Energy Prediction Tasks. This is in addition to the Open Catalyst dataset, which was already included in the initial pre-release in 2022. Energy prediction is one of the most common prediction tasks in both molecular and solid-state crystal structure modeling, and it is generally included in most relevant databases. Since it is a critical property of a materials system, energy can be used to understand different aspects of materials behavior. Energy also has inspired energy-based learning methods in machine learning, which have led to breakthroughs in generative AI influenced by physical concepts. The widespread availability of energy labels in datasets such as Open Catalyst allows researchers to combine multiple datasets in a multi-data setting to train AI models that generalize to a broader set of use cases.
Toward Generalist Materials Models
The new datasets and generative pipeline provide a greater diversity of tasks that will enable researchers to train more generalist AI models for materials science. These generalist models are important in understanding the behavior of new materials needed for a diversity of applications to address today’s societal challenges. Materials needed to mitigate climate change, for example, will often exhibit different behavior from materials needed to provide next-generation computers, which are then different from drug-like compounds. Generalist AI models for materials evaluation and materials design provide the first step in substantially accelerating materials discovery to meet current and emerging needs.
Table 2. Benchmark results on materials project multi-task learning. Shown are the best performing (highlighted in dark blue) along with single-task baseline (highlighted in gray) with each multi-task run outperforming the single-task baseline (highlighted in light blue). Multi-task learning generally outperforms single task on regression tasks with only small performance difference between additive losses and PCGrad, a state-of-the-art multi-task learning method.
In our research study using MatSci ML, models are trained in multi-task (using one model to predict multiple targets) and multi-data (using one model to predict the same task from multiple datasets) training settings. As shown in Tables 2 and 3, training on multiple related tasks tends to improve model performance, illustrating the value of additional information. This holds for multiple multi-task methods, vanilla multi-task as well as PCGrad, which minimizes conflicts between tasks.
Table 3. Benchmark results on energy+forces multi-dataset learning. Shown are the best performing (highlighted in dark blue) along with single-task baseline (highlighted in gray) with each multi-task run outperforming the single-task baseline (highlighted in light blue). Multi-data outperforms the single-task baseline in some cases for both models. A dash indicates not applicable for this setting.
We observe similar effects in the multi-data settings where additional data from closely related tasks generally leads to increased model performance for both energy and force prediction.
Generative Materials Pipeline
In addition to being able to quickly evaluate the properties of materials using advanced AI models, generating new materials is essential to addressing today’s complex challenges. Given that a material is made up of a complicated arrangement of atoms with many different parts, including atomic types, atomic positions, and geometric arrangement, a generative materials pipeline requires targeted features not found in many modern generative AI methods.
Table 4. Generation quality metrics of CDVAE matching the quality of the original implementation in Xie et al. with a new subsample of Materials Project (mp25).
Nonetheless, we can build upon the success of modern generative AI techniques, such as diffusion models in the image and language domains, to create the foundation of a generative materials pipeline. To do so, we integrated Crystal Diffusion Variational Autoencoder (CDVAE) in the Open MatSci ML Toolkit. In our experiments, we can extend the training dataset of the original CDVAE paper and closely match the reported results from the original paper as shown in Table 4.
Discovering, modeling, evaluating, and understanding solid-state materials will continue to play a significant role in complex technological challenges of the future with subsequent versions of the Open MatSciML Toolkit continuing to enable practitioners to leverage the most advanced AI methods.
Open MatSci ML Toolkit 1.0 is available under an open-source MIT license on GitHub.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.