Intel® Optimization for XGBoost on Ray with RayDP Delivers Better Performance than Apache Spark

KariFredheim · ‎10-19-2023

Co-Authors:

Carson Wang, Intel AI Software Engineering Manager
Zhi Lin, Intel AI Frameworks Engineer

XGBoost is a popular and robust algorithm used to train machine-learning AI models, helping you to understand your data better and optimize machine learning. Training a model often involves some level of data processing, so running distributed XGBoost on Apache Spark has been common practice to leverage Spark’s powerful data processing engine for things like feature engineering. We recently discovered that this approach may not provide optimal performance.

In this article, we propose running XGBoost on Apache Spark against XGBoost on Ray, with the following results:

1.9x ~ 12.9x faster data load
1.9x ~ 6.3x faster DMatrix conversion
1.1x ~ 1.7x faster training, and
An overall 1.2x ~ 3.9x increase in speed from end to end.

Why Spark?

Apache Spark is a popular data processing framework that can quickly perform processing tasks on large datasets, providing good performance and fault tolerance ability. Because Spark integrates XGBoost into the Spark MLlib, they combine their advantages, running an XGBoost training instance on each Spark executor.

So, Who's Ray?

Ray is an emerging unified framework providing simple APIs to easily scale various applications, like machine learning. It also provides libraries for popular machine-learning frameworks. On Ray, XGBoost can scale. For our data processing needs, we leveraged another library, RayDP, which provides simple APIs to run Spark on Ray while exchanging data with other AI libraries on Ray. With Ray, you can easily build an end-to-end pipeline from data ingestion to model serving in one script.

What is Limiting XGBoost on Spark?

Typically, a Spark application consists of a series of operations on a dataset, which is then converted to tasks by executor processes. Users can configure the number of executor processes and the number of cores and memory of each executor. By default, one task takes up one core of an executor.

However, in XGBoost on Spark, we also need to configure spark.task.cpus, which is the number of cores that each task needs assigned to each executor in order to leverage multi-threading while training the model using XGBoost. By doing this, you are limiting the parallelism used in other stages, such as data loading and conversion, which hurts the overall end-to-end performance.

Now We Have a Conundrum

We must find an optimal configuration balancing the tradeoff between insufficient parallelism in the data loading stage and too high parallelism in the training stage. For example, if we have two executors on each node, then only two tasks are loading data in parallel, while we could have as many tasks as there are cores on the node. On the flip side, increasing the number of executors to equal the number of cores will make the training stage less efficient due to high network overhead.

What's the Solution?

Running Spark and XGBoost on Ray can solve this whole problem. Its flexibility allows the data loading stage and the model training stage to stay separated. During data loading, RayDP loads data from HDFS and then converts the result data frame to Ray Dataset, which is consumed in the training stage.

After this, the RayDP cluster is shut down to free up resources, and new worker threads are generated to run XGBoost on Ray. The previously loaded data is distributed to these workers based on locality, so the overhead of using different workers in two stages is minimized.

Since we have better configured the Spark executors for data loading, we can yield the best performance regardless of the number of executors we use. The improved speed generated for different numbers of executors is illustrated below.

We took it one step further by implementing an optimization for XGBoost on Ray, enabling multi-threading when converting data to an internal data format required by XGBoost. In previous versions of XGBoost on Ray, the data conversion took a much longer time to finish, the same as XGBoost on Spark.

This chart illustrates the value this optimization creates.

If you’re using machines with NUMA nodes, it is easy to make Ray NUMA-aware. When starting Ray on each machine, start as many Raylets as there are NUMA nodes, then bind each to a NUMA node. By doing so, Ray’s Object Store will be divided among the NUMA nodes, and thus, the locality-based scheduling provided by Ray is extended to the NUMA-aware scheduling. Training performance can be improved because now the actors are assigned data shards on their NUMA node.

We can’t do this on Apache Spark without using third-party libraries. The following chart compares the performance of XGBoost on Ray with NUMA optimization against XGBoost on Spark without NUMA optimization.

And finally, we calculated our end-to-end time by summing up the above three stages and comparing their performance. Notice that we did need to add some overhead time to XGBoost on Ray because it takes extra time to start new workers for training and to fetch data shards to each actor. Even so, it outperforms XGBoost on Spark significantly.

Conclusion

The data is clear: RayDP plus XGBoost on Ray demonstrates much better performance than XGBoost on Spark in all configurations. This holds true even without NUMA optimization for Ray.

You may notice that our end-to-end time is longer when we use 64 executors/actors because the network cost slows training. For XGBoost on Spark, 32 executors yielded the best performance. For XGBoost on Ray, using 4 actors resulted in the best configuration. Looking at training time, this also holds for Spark as well. However, the huge benefit of larger parallelism in the data loading stage conceals it.

In XGBoost on Spark, the dominating data loading creates a large variance in overall end-to-end time, while Ray’s remains stable. This means that optimal performance on Spark requires careful configuration tuning, but even so, Ray’s performance is still a clear winner.

As a bonus, XGBoost on Ray also provides some other advantages, such as model serving, using Ray Serve, and elastic training, which allows training to continue with fewer actors if some actors fail, and integration with Ray Tune, which makes hyperparameter tuning very easy. For details, please check https://www.anyscale.com/blog/distributed-xgboost-training-with-ray.

System Configurations

4 Nodes Each with

Configuration	Details
Test Date	Test by Intel as of 06/2023
Manufacturer	Inspur
CPU Model	Intel® Xeon® Platinum 8358 CPU @ 2.60GHz
# of Nodes	4
CPU per Node	32 cores/socket, 2 socket, 2 threads/core
Memory	512GB (16x32GB DDR4 3200 MT/s [3200 MT/s])
Storage	1x 223.6G INTEL SSDSCKKB24, 6x 931.5G INTEL SSDPE2KX010T8
Network	2x Ethernet Controller 10G X550T, 2x Ethernet Controller XXV710 for 25GbE SFP28
XGBoost	1.7.4
Apache Spark	3.1.3
Apache Hadoop	3.2.4
Ray	2.4.0
RayDP	1.6.0b20230614.dev0
Xgboost_ray	0.1.16

Notices & Disclaimers

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Intel disclaims all express and implied warranties, including, without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.