The Odyssey of One-line Optimization Code Landing in Chromium

Wanming_Lin · ‎12-16-2021

Authors: Wanming Lin, Hong Zheng

Web user experience is very critical for cloud service providers. According to Google Research, as page load time goes from 1s to 10s, the probability of a mobile site visitor bouncing increases 123%. While decreasing mobile site load times by just 0.1s, conversion rates went up by 8.4% for retail and 10.1% for travel.

Image source: Google

JavaScript is an event driven programming language. In particular, the JavaScript Timer interface is one of the key factors that impacts page load time, since oftentimes it is necessary to trigger a function callback after a specific moment to enhance user experience on their websites. And its “setTimeout()” API would help with this task-scheduling by postponing the execution of a function until the specified time expired.

We come from the Web Optimization team, focus on power and performance optimization for Web platform, especially Chromium, aspiring best user experience on IA. Chromium is an open-source browser project, and the foundation of Google Chrome, Microsoft Edge and many other browsers. Chromium has a fairly strict code review process. Specifically for optimization patches, they should not only guarantee functional correctness, but also need to bring reasonable power or performance benefits.

This article introduces a tortuous story of landing only one line optimization code about removing 1ms clamping in "setTimeout(..., 0)" into Chromium, as it is not web-compatible and developers will always pay the 1ms penalty for that. While landing such simplified code in Chromium is not as easy as imagined. Let's look at what we have experienced in the journey.

Optimization Analysis

In order to optimize the JavaScript Timer performance and improve user experience, we investigated its implementation in Chromium and found that there were two kinds of time clampings in its "setTimeout()" API.

4ms clamping: Chrome will enforce a minimum timeout of 4ms once a nested call to "setTimeout()" has been scheduled 5 times.
1ms clamping: Chrome clamps up "setTimeout(..., 0)" to 1ms outside the above scenario.

We used Speedometer2 to measure the impact of these two time clampings. It is a popular industry browser benchmark that measures the responsiveness of Web applications, also an important KPI on Intel platforms. We found that if we disable the 4ms clamping, the overall score of Speedometer2 can boost ~4% on Win 10. But this clamping is specified by the HTML5 spec and consistent across browsers released in 2010 and onward. Therefore, we can do nothing for it at the current stage. For 1ms clamping, it isn't specified by any spec, and without it the overall score of Speedometer2 can boost ~1.5% on Win 10. Furthermore, the fixing is so "easy", only needs to add one line code.

Challenges

When we tried to land the optimization, Chromium upstream told us 1ms clamping is a legacy issue. Historically browsers clamped up "setTimeout(..., 0)" to a few milliseconds to prevent excessive CPU usage as many developers liked to use which as a way to make an async function to be executed as soon as possible. Since 2014, Google has attempted to remove this 1ms clamping for several times (see 402694 - setTimeout(,0) rounds to 1), but they all failed for 3 main reasons:

Removing this 1ms clamping will cause lots of flaky and failure tests in Chromium.
Even if they fixed all these failure tests, the CL for removing 1ms clamping may be reverted by some additional and unexpected failure tests and as well as performance regression from Pinpoint.
Google may still be worried about how this change may impact real world websites.

With these challenges, how could we convince Google to accept this optimization again? We have two justifications, one is performance beneficial, the other one is that it’s not standard conformance. So, we take the challenges.

Fixing flaky/failure tests

Before removing 1ms clamping, we should fix all those flaky/failure tests at first.

There's numerous content that incorrectly uses "setTimeout(…, 0)", which will misbehave if we change this. The main reason is that they rely on "setTimeout(…, 0)" to execute async functions immediately, which would cause race conditions as now code gets executed too "early". This could be fixed by a pragmatic way. Use "setTimeout(..., 1)" instead of "setTimeout(..., 0)", since that's effectively what they were already doing. The reasons for the remaining small number of flaky/failure tests may be variance. We have to take much effort to diagnose them one by one.

We totally captured hundreds of flaky/failure tests from Chromium’s Trybots, consisting of 3 test types, Browser Tests, Web Tests, and gtest, and divided into various test binaries. Tests come from different components and fail on various platforms, which increases the difficulty and time consumption for local compilation and debugging.

Fortunately, things went smoothly, we fixed all the flaky/failure tests and succeeded to land them in Chromium with great support from Google. It seems we can land the optimization then.

Launching features process

However, when we started the first trial to remove 1ms clamping, reviewers from Google raised their concern on how this change may impact web developers as it has already affected so many Chromium tests. We need to do more testing and inform web developers of this change through the Chromium’s Launch features process with the feature type of "Web developer facing change to existing code." This is generally a public service announcement (PSA) - "This is a web-developer-facing change to existing code without API changes, but you may see side effects."

We followed up the process to create a ticket on ChromeStatus, which is used for tracking the feature status. Then proceeded to the “Prepare to Ship” stage, it generated a “Web-Facing Change PSA” mail to blink-dev community with the summary of the code change and the expected milestone to seek feedback from web developers and API owners. During this period, people expressed their worries that this seems pretty high-risk. Therefore they required an intent-to-ship thread to appeal broader review on the feature compatibility, risk of change, design issues, and etc.

Then the discussion migrated to the intent-to-ship thread and we received more feedback from the community. At the beginning, some reviewers showed their concerns that this change would bring risks. It may change the task ordering and affect the existing websites where abusing the "setTimeout(..., 0)". They particularly concerned about the motivation for this optimization and what benefits there are, as well as the compatibilities against other browsers.

We paid a lot of effort to collect persuasive information to convince them to accept this change, e.g., it could bring performance benefits, it’s web-compatible since there’s no 1ms clamping in Firefox, etc. Finally, they agreed to flag this on Chrome Beta for one release, that meant landing our change to Chrome Beta to collect user feedback, bugs during the period, then reverting it before Chrome Stable. If there’s no critical bug or performance regression during the testing, they will accept our optimization.

Testing on Chrome Beta

Then, there came another problem. The first CL for removing the 1ms clamp was reverted by unexpected additional tests that failed on multiple builders after landing. Chromium itself has regular testing with a larger scope than the Trybots we see on the CL. Therefore, we had to fix these additional failures, after several rounds of repeated reversion and relanding, we finally cleared all the noisy flaky tests and landed the change in Chrome 91 Beta.

During testing on Chrome 91 Beta, we received some bug reports, but only two are real issues. One is Pinpoint regression. Pinpoint is a performance test bot in Chromium responsible for performing regular performance tests and reporting regressions automatically. The other is about page scrolling regression in Wikipedia web pages, which is filed by end users. We then took time to investigate them and finally identified that the first issue is a test itself bug and the second one is a Chrome bug. We shifted off our findings to Chrome by creating bugs and got their affirmation, so far can we start celebrating for the final ship?

Further discussions

When we delivered our test report to the intent-to-ship thread, feedback was not quite positive. They agreed that those sites shouldn’t depend on these relative timings, and it’s technically a site bug if so. But if it is widespread enough it still represents a large enough problem that it would block shipping this change in behavior. From a process point of view, they suggested holding on to this intent until we think it is ready to try again with a more powerful impetus.

When the process reached an impasse, one engineer from the Google V8 team brought the dawn with his finding that removing 1ms clamping on Pinpoint can bring 5-6% improvement to M1, and 3% on Win 10. Although we were skeptical about this data, at least our optimization has attracted more attention and support. At the same time, we pushed forward fixing those bugs we reported, then the community agreed to experiment on Chrome Canary/Dev/Beta again.

Testing on Chrome Beta again

Things were going into a new cycle again, we managed a new round of testing on Chrome 95 Beta, and fortunately this time we didn’t find any outstanding bugs. Although there were still worries from the community on potential risk for this change, they agreed to move things forward by proceeding with a Finch trial to control the risk and increase confidence.

Finch trial

Chrome rolls out new features dynamically using the Finch trial platform, a set of server-controlled flags that allow Google to change Chrome’s behavior dynamically without shipping a new version. For this optimization, that means putting the change behind a feature flag, and turning it on via Finch for Canary, Dev and Beta channels. Then let it run for several releases, gathering additional feedback on bugs to gain confidence that this is safe to ship.

By the time we wrote this article, Finch trial on Chrome 97 Canary/Dev/Beta ran smoothly.

To be continued

We have been following up this one-line optimization code landing for more than one year. Totally landed 24 CLs and fixed over one hundred flaky/failure tests. Very appreciate to those warm-hearted Google engineers’ great support all the way around our journey. In a word, optimization is really difficult but with full reward, just keep up great work!