Artificial Intelligence (AI)
Engage with our experts on topics in AI
291 Discussions

Data Science at Scale with Modin

MaryT_Intel
Community Manager
0 0 414

Intel Distribution of Modin in the Intel® oneAPI AI Analytics Toolkit
Enables Scalable Data Analytics

Last updated: 2021-03-15

 

AI and data science are advancing rapidly, which enables us to deal with more and more data and do more and more complicated things. On the other hand, we see that these advances are shifting focus from value extraction to systems engineering. This trend puts pressure on data scientists to be more data systems or cloud systems engineers, and deal with infrastructure-related issues instead of focusing on the core part of data science — generating insights. One of the causes of this shift is the absence of optimized data science and machine learning infrastructure for data scientists who are not necessarily software engineers. We know that data scientists are creatures of habit. They like the tools that they’re used to in the Python data stack, e.g.: pandas, Scikit-learn, NumPy, PyTorch, etc. However, these tools are often unsuited to parallel processing or terabytes of data. The Intel® oneAPI AI Analytics Toolkit (AI Kit) aims to solve the data scientists’ most critical and central problem: how to make their familiar software stack and APIs scalable?


Today, we will talk about one of the main components of the AI Kit: the Intel Distribution of Modin. Modin is a performant, fully pandas API compatible library. The only thing you need to do to accelerate your pandas workload is to replace a single line of code: import modin.pandas as pd instead of import pandas as pd. Modin has three distinguishing characteristics that we are going to cover in a series of blogs:

  1. Parallelized pandas for high performance. As of v0.9, Modin supports 94% of the pandas API and is integrated with the Python ecosystem (e.g., NumPy, XGBoost, Scikit-learn).
     
  2. The ability to run pandas workloads on different backends. Out of the box, Intel Distribution of Modin supports the OmniSci DB engine, a high-performance framework for end-to-end analytics that has been optimized for current and future Intel hardware, including GPUs.
     
  3. On-demand, practically infinite scalability to the cloud, right from your Jupyter notebook.

As an added bonus, Modin also has a rich frontend supporting SQL, a spreadsheet API, and Jupyter notebooks.
 

Installation

The easiest way to get Modin is via Intel’s Anaconda channel, either as a part of the AI Analytics Toolkit or stand-alone from the stock conda-forge channel.


Installing Modin from the AI Kit

The AI Analytics Toolkit provides a consolidated package of Intel’s latest deep and machine learning optimizations all in one place, with seamless interoperability and high performance. The toolkit includes Intel-optimized versions of machine learning frameworks and Python libraries along with Modin to streamline end-to-end data science and AI workflows on Intel architectures.


Intel Distribution of Modin is made available through the Conda package manager of the AI Analytics Toolkit:

conda create -n aikit-modin intel-aikit-modin -c intel -c conda-forge
conda activate aikit-modin


Installing Modin from the Stock conda-forge Channel

Alternately, Modin can be installed from Conda forge channel. If you use this method, you’ll need to install OmniSci separately. (OmniSci is included if you install using the AI Kit.)

conda create -n stock-modin modin -c conda-forge


Modin Scalability

To showcase Modin’s scalability, perhaps its most important but least known capability, we’ll use the well-known NYC Taxi example. The NYC Taxi benchmark consists of four workloads. We’ll use the first: a group-by query of the trips_data.csv data in a Modin dataframe.

import modin.pandas as pd
df = pd.read_csv('~/trips_data.csv')
df.groupby("cab_type").size()


This is equivalent of the SQL statement:

SELECT cab_type, count(*) FROM trips GROUP BY cab_type;


The Modin query is executed locally on your laptop, but what if more compute power is needed and the 1.5B records don’t fit into the local storage? In this case, Modin provides experimental remote cluster capabilities. For example:

import modin.pandas as pd
from modin.experimental.cloud import clusterwith cluster.create("aws", "aws_credentials"):
    df = pd.read_csv('s3:/taxi_data/trips_data.csv')
    df.groupby("cab_type").size()


The with statement creates a remote execution context in the cloud, Amazon Web Services in this case, with credentials provided by the user in aws_credentials.json. Modin automatically connects to Amazon Web Services, spawns a cluster for distributed computation, provisions the Modin environment, then remotely executes all the Modin statements within the with clause. From the user perspective, this all appears to be happening locally.

 

Notices and Disclaimers

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available options. Learn more at www.Intel.com/PerformanceIndex.

Intel technologies may require enabled hardware, software, or service activation. No product or component can be absolutely secure.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
 

About the Author
Mary is the Community Manager for this site. She likes to bike, and do college and career coaching for high school students in her spare time.