You Don’t Have to Spend $800,000 to Compute PageRank

MaryT_Intel · ‎07-20-2020

There’s a Better Way to Do Large-Scale Graph Analytics

Benchmarking isn’t my favorite topic, but I have a passing interest in graph analytics benchmarking:

Measuring Graph Analytics Performace

What is Graph Analytics and Why Does it Matter?

I’ll occasionally dissect benchmarks that I think are inaccurate or misleading:

Adventures in Graph Analytics Benchmarking

It's important to use a benchmark for its intended purpose

More Adventures in Graph Analytics Benchmarking

Or, why benchmark reproducibility matters

And I’ll also dissect benchmarks that only tell part of the story. I was half-listening to Jensen Huang’s NVIDIA GTC 2020 Keynote from May 14, 2020 when one of his performance claims caught my attention. At about the 19:30 minute mark of Part 6, the presentation turns to large-scale graph analytics, and claims that a DGX A100 rack can compute PageRank (PR) on a 128-billion-edge web graph at 688 billion edges per second. I don’t have access to a DGX A100 rack, but if we assume for the sake of argument that this is true, is it really necessary to spend $800,000 to compute PR on this graph? I don’t think so, but let’s dissect this benchmark so you can decide.

Some of my Intel and academic colleagues published PR results for a 128-billion-edge web graph: Single machine graph analytics on massive datasets using Intel Optane DC persistent memory. This article was published in April 2020, but it was originally posted on arXiv.org back in April 2019 — a year before the GTC keynote. They achieved 75 billion edges per second using a single 2U server with two 2nd-generation 2.2 GHz Intel Xeon Scalable processors, 384GB DDR4 memory, and 6TB Intel Optane DC persistent memory. Note that the server in this study contained 6TB of Optane memory, but only used 2TB:

When we evaluated explicit Huge Pages allocation, we reserved 2TB and 360GB for Optane PMM and DRAM experiments, respectively.

This was sufficient to load the graph. There isn’t enough source information to confirm that the web crawl graphs are identical, but I’m pretty sure they’re the same or at least very similar.

If we compare the system characteristics for the respective PR computations, it becomes clear that performance alone doesn’t tell the whole story. The table below shows that while the 2U Xeon-based server achieves only 10.9% of the DGX A100 rack performance, it does so at 6.3% of the price, 3.3% of the power, and 8.3% of the space. Put another way, the Xeon-based system has 1.7x higher performance per dollar, 3.3x higher performance per watt, and 1.3x higher performance per rack unit.

System	#	Performance (billion edges/second)	Price (USD)	Power (kW)	Size (Rack Units)
NVIDIA DGX A100	4^a	688^a	$800,000^a	22.4^a	24^a
Intel Xeon + Optane	1^b	75^b	$50,000^c	0.75^c	2^c

^aNVIDIA GTC 2020 Keynote (Part 6, near the 19:30 mark). The PR case study used for DGX A100s at roughly $200,000, 5.6kW, and 6U each.

^b Gill et al. (2020). Single machine graph analytics on massive datasets using Intel Optane DC persistent memory, Proceedings of the VLDB Endowment, 13(8), 1304-1318.

^c A 2U dual-Xeon server with 384GB DDR4, 2TB Optane, and a 750W power supply was sourced from a leading system vendor on June 29, 2020.

If this wasn’t enough, PR is just one of many graph analytics algorithms. It’s also easy to parallelize, which is why it was chosen to showcase the DGX A100. Other graph analytics algorithms are not as easy to parallelize efficiently, and don’t scale to the level of PR, which is why they weren’t chosen. I doubt that scalable parallel implementations of other graph algorithms are even available for the DGX A100. In contrast, the year-old article cited above performs computations on the same 128-billion-edge web graph for PR and many other common algorithms: breadth-first search, betweenness centrality, connected components, k-core decomposition, and single-source shortest paths.

If PR performance is all that matters to you, and you have $800,000 to spare, a 22-kW power plant, and plenty of cooling capacity, then by all means use a rack of DGX A100s. Otherwise, a single Xeon-based server with sufficient memory is a much better choice for large-scale, general-purpose graph analytics.