Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
597 Discussions

Python* Data Science at Scale

Jack_Erickson
Employee
1 0 10.7K

Data scientists and AI developers need to be able to explore and experiment with extremely large datasets as they converge on novel solutions for deployment in production applications. Exploration and experimentation require a lot of iteration, which is only feasible with fast turnaround times. While model training performance is an important piece of the puzzle, the entire end-to-end process must be addressed. Loading, exploring, cleaning, and adding features to large datasets can often be so time-consuming that it limits exploration and experimentation. And responsiveness during inference is often crucial once a model is deployed.

Many of the solutions for large-scale AI development require installing new packages and rewriting code to use their APIs. For instance, data scientists and AI developers often use pandas to load data for machine learning applications. But once the size of the dataset gets to about 100MB or higher, loading and cleaning the data really slows down because pandas is single-core only. At this point, they have to change their workflow to use different data loading and pre-processing, such as switching to Apache Spark*, which requires data scientists to learn the Spark API and overhaul their code to integrate it. This is usually an inopportune time to make such changes and is not a good use of data scientists’ and AI developers’ skills.

Intel has been working to improve performance of popular Python* libraries while maintaining the usability of Python, by implementing the key underlying algorithms in native code using oneAPI performance libraries. This delivers concurrency at multiple levels, such as vectorization, multi-threading, and multi-processing, and does so with minimal impact on existing code. For example:

See how easy it is to accelerate your end-to-end workflow with these technologies, as Rachel Oberman of Intel demonstrates using the full New York City taxi fare data set.

 

To get started, you don’t need to directly interact with the oneAPI libraries – just download the Intel-optimized versions of your Python libraries or the Intel® AI Analytics Toolkit (AI Kit). If you use Anaconda*, many of these libraries are available in the default conda channel, while others are in the Intel channel. Or to download components to install directly, visit AI Tools, Libraries, and Framework Optimizations.

About the Author
Technical marketing manager for Intel AI/ML product and solutions. Previous to Intel, I spent 7.5 years at MathWorks in technical marketing for the HDL product line, and 20 years at Cadence Design Systems in various technical and marketing roles for synthesis, simulation, and other verification technologies.