L20n/Firefox/Performance

From MozillaWiki
< L20n‎ | Firefox
Jump to: navigation, search

Summary

At the moment we see a number of talos regressions compared to mozilla-central on treeherder:

  • sessionrestore - ~5.1% hit (44ms)
  • sessionrestore(e10s) - ~6.3% hit (47ms)
  • tpaint - ~14% hit (40ms)
  • tpaint(e10s) - ~13% hit (36ms)
  • ts_paint - ~1.5% hit (14ms)
  • ts_paint(e10s) - ~1.7% hit (15ms)

The breakdown of tpaint when run to a given stage, as measured on treeherder (window-e10s), is as follows:

You can look at a different view at an experimental view. The runs are

  • mozilla-central
  • m-c, plus the l20n.js script, but no execution
  • and then two lines of the same, one without, and one with caching the Localization object:
    • create/get the Localization object, MessageContext, initial IO if necessary
    • the above and get the list of translatable elements
    • the above and inserting the localized messages into the DOM


Current (slowed down) profile from cleopatra.

Here's the analysis of a (full-speed) time allocation:

  • 5ms parsing
  • 3ms getting resource links
  • 3ms creating Localization object, create permutation of Bundles, starting IO
  • 5ms getTranslatables. queryElementSelector cost
  • 5ms apply translations, inserting into DOM
  • 4ms formatting to strings
  • 2ms create chrome resource bundle
  • 6ms .... connecting the dots

Potential Performance improvements

Based on the experiments we did, there's a couple of short-comings in the platform. When one of those short-comings are optimized, that could open up optimizations in the l20n.js code.

Optimize DOM pathways we use extensively

smaug is profiling our code trying to find ways to reduce the cost of Element reflections and applying translations. https://searchfox.org/mozilla-central/rev/d03ad8843e3bf2e856126bc53b0475c595e5183b/dom/base/Element.cpp#545-548 is a good starting point.

Stylo

Stylo may significantly improve the cost of Node reflection that is currently, according to :smaug, slowed down by CSS checking for XBL bindings.

JIT optimizations

There are two parts of l20n bootstrap that are JS heavy - FTL parsing and Entry formatting. bholley would like to add code paths to JIT that are dominent in our js code.

Caching

We're currently only caching I/O. At a point where the timing of l20n.js isn't dominated by DOM/CSS, we should consider the impact of caching

  • MessageContext or Localization objects
  • formatted entries

Caching formatted entries would be significantly more engineering work than MessageContext or Localization objects.

Move DOM operations earlier

Currently, the best tpaint times gives us executing DOM operations on document.readyState === 'complete'. That means that we're waiting for a lot of time between when DOM is completed and when we operate on DOM potentially causing reflows. Unfortunately, attempting to start DOM operations between when parsing is done and when layout starts results in worse tpaint times. The possible cause of it, according to :smaug, is that operating on DOM before layout may require costly CSS lookups for XBL bindings. It's possible that stylo may solve some of that.

Another possibility that we're experimenting with is triggering localization to happen after layout but before reflow.

As soon as the DOM/CSS interaction is quick, we could investigate the possible impact.

Research Avenues

Caching parsed FTLs

One idea we have is to switch from caching strings that IO returns, to caching whole parsed Entries objects. This should shave the whole parser cost out of the tpaint test results. When tested, Zibi was able to confirm that hypothesis - tpaint without e10ns went down from 13% to 10%, and with e10ns down from 8% to ~6.5%. We decided not to land this yet, because this optimization is mainly going to help with the scenario that tpaint tests - opening a new window 20 times, while what we want to fix is the regression of opening a new window the first time (ts_paint) and we're concerned that too much caching may impact our ability to understand where we pay the cost. Our current idea is to keep the patch and land id once ts_paint is not regressing and only tpaint is.

Promises are fast if they're few

Staś talked to Till about Promises. Till said Promises are generally fast unless 1) you create a lot of them and 2) javascript.options.asyncstack is true (which it is in Nightly). In our case, we're creating under 10 Promises total and there shouldn't be any perf impact. Till advised against that the formatEntity scenario (where each translation has its own Promise) warning about the perf hit of creating a lot of Promises. This is in line with what we saw in our testing.

javascript.options.asyncstack is only true on Nightly right now but Till wants it off everywhere. If I can prove that it affects the performance of the formatEntity scenario, that would help Till's cause.

I showed him the profile and he noticed that our code actually uses Promise.jsm and not his implementation of the DOM Promise. It's likely that another script in browser.xul imports Promise.jsm and overrides the DOM Promise.

NodeList iteration costs 4ms

In the numbers above we see the iteration over the NodeList returned by the qSA takes 4ms. We also know from previous tests below that subsequent iterations take less than 1ms each. I wanted to test if the 4ms is just the normal cost of the iteration and if it's then cached.

(a) Array.from(document.querySelectorAll('[data-l10n-id]'));
(b) Array.from(document.querySelectorAll('[data-l10n-id]'));

This iterates over 114 nodes with data-l10n-id. (a) takes 4.5ms and (b) takes 0.25ms. I then used setTimeout to run the same experiment 10 seconds after the launch of the browser. The numbers were the same. I then tried 'touching' part of those 114 nodes with another qSA. 70 of these nodes have a command attribute, so I ran the following:

(a) Array.from(document.querySelectorAll('[data-l10n-id][command]'));
(b) Array.from(document.querySelectorAll('[data-l10n-id]'));

Now (a) takes 3.5ms and (b) takes 2ms! It looks like the first NodeList is cached too:

(a) Array.from(document.querySelectorAll('[data-l10n-id][command]'));
(b) Array.from(document.querySelectorAll('[data-l10n-id][command]'));
(c) Array.from(document.querySelectorAll('[data-l10n-id]'));

The results: (a): 2.05 ms, (b): 0.18 ms, (c): 2.40 ms. Finally I wanted to verify that the cost is indeed in the iteration and not in the querySelectorAll itself or in the creation of the iterator:

(a) for (const node of document.querySelectorAll('[data-l10n-id]')) break;
(b) Array.from(document.querySelectorAll('[data-l10n-id]'));

The results: (a): 0.26 ms, (b): 6.67 ms. It's the iteration that takes almost all of the time.

Using MutationObserver instead of qSA at startup

wchen suggested that instead of using qSA to capture links and localizable nodes, we should use MutationObserver. I shared our testing results that indicate that MutationObserver is slow and we discussed potential reasons and what can we do to speed it up.

The conclusion is a three-tier approach:

  • Filter out non-data-l10n-id elements on C++ side so that they never fire onMutations
  • Gather all data-l10n-id elements feed from the HTML/XUL parser and deliver a single MutationRecord for all of them just before layout
  • Dive into childNodes of MutationRecords and only report to us the ones that have data-l10n-id so that we don't have to querySelector into childNode DOMFragments
wchen said he'll give us a POC patch for (1) and he hopes that it'll fix majority of the performance hit we see. If that will not solve it, he'll help us profile the DOM operations further and we can then consider adding (2) and (3) but that may take a bit more time to hack.

The (3) ended up not being needed. The way MutationObserver works in XUL is that it always returns single elements, not DOMFragments. I tested (1) with :wchen's patch and it ended up not improving performance. Further analysis (read below) indicates that neither would (2).

DOM optimizations - end of path

After testing the previous approach and some other ideas, :wchen profiled our work and came to conclusion that there's basically nothing more we can do to reduce the performance penalty on querying through DOM or applying translations to DOM. Quoting :wchen: "I profiled the test case and it looks like the main reason for slowness is in creating a js reflector of elements with XBL bindings. We need to install the XBL binding on each element when we create the js object (your test cases is doing this when Array.from touches the elements). I can't think of any optimizations to work around it, you're going to have to pay that cost somewhere if you want to touch the object in js because we set up the js prototype during binding installation." If we want to touch elements without creating the js object, we'd have to, quoting :wchen: "do it in c++ :(".

JIT optimizations - end of path

I worked with :shu to profile our code with JIT Coach, and it seems that our code jits pretty well. There are several things we can optimize further by asking JIT team to inline certain functions we use a lot (String.prototype.slice, String.prototype.indexOf etc.), but it doesn't seem like it would have any significant impact on our perf. One other interesting aspect is that in several places our code gets run enough times to get JITted, but not enough to benefit from the JITting and we end up with the short end of the stick. Nothing we can do about it. My current position is that I'll file those bugs and we'll get a bit more inlining but I don't expect us to get a significant enough return on investment to move talos numbers this way.

Previous Research

At the end of Sprint 5 we merged mozilla-central into larch and re-rum Talos tests. The results show the following tests regressing in performance:

  • sessionrestore - ~8% hit
  • tpaint - ~20% hit
  • ts_paint - ~8% hit

Hypothesis: IO Is Slow

Are we IO-bound? We start fetching resources almost 60ms after we start l20n.js. We can start earlier. We also seem to be blocked on the IO front and the return of the results come around 150ms when executed at the point where we do this, while without load the IO takes only 3-4ms. It may be that it's because we're async or within Promises. Let's try to make it sync.

Testing Strategies
  • hardcode ftl strings in l20n-chrome-xul
  • execute a sync IO call for all resources
  • wrap it in Promise but preserve sync
  • try to execute it at different positions
  • see why NetUtil is slower than XHR
  • ask for API extension to get a better priority for l10n files in the ZipReader queue

It doesn't really matter when exactly the IO starts. In (A) all of L20n's initialization is run after the documentReady() promise resolves. In (B) the IO starts as soon as possible (as soon as we can query for <link rel="localization">).

 (A) IO starts: 100 ms, IO ends: 140 ms,
 (B) IO starts:  40 ms, IO ends: 140 ms.

In (B) we still need to guard the initial translation of the document with documentReady(). The reason for this is that we can't rely on the unguarded MutationObserver in XUL documents: bug 1294197.

Hypothesis: NodeList Iteration Is Very Slow

Are we DOM-bound? Iterating over the [data-l10n-id] NodeList is slow: taking 25-30 ms on a 114-element-long NodeList (on 3rd+ launch). It doesn't matter if the iteration is done via a regular for loop, for…of or Array.from. See getTranslatables.

Testing Strategies
  • Use getElementByClassName
  • Iterate twice, second time right after the first one

I tested the performance of querySelectorAll during startup:

First time
  • querySelectorAll: 0.09 ms
  • NodeList.length: 0.06 ms
  • Array.from: 26.61ms
Second time
  • querySelectorAll: 0.12 ms
  • NodeList.length: 0.07 ms
  • Array.from: 0.22ms

I doesn't matter if we're using querySelectorAll or getElementsByClassName (with a special class for this purpose). Hypothesis: we're the first code that requests these DOM nodes and it's slow. Subsequent iterations are very fast.

After fixing bug 1298974 the NodeList iteration isn't a bottleneck anymore. It turned out XUL documents have different readystates than HTML and l20n's logic was being run too early when the DOM wasn't ready yet. With the fix, the first iteration takes less than 5 ms. This is still much more than the subsequent calls to querySelectorAll(); see Lazy-to-actual XULElements conversion for more discussion about it.

Hypothesis: Formatting Translations Is Slow

Are we CPU-bound? formatWithFallback is taking 20-25 ms. It seems that much of that is being spent handling errors (concatenating them with previously reported ones).

Testing Strategies
  • Hardcode formatted translations in l20n-chrome-xul
  • Throw errors and collect them in a try…catch.

I found four major bottlenecks. I filed bug 1296618 to fix the first three.

  • destructuring assignment is slow in hotter loops,
  • formatting traits bails out of the simple-string optimization because the entire trait object is passed to format() instead of its val,
  • formatting entities with null values and no default traits is slow because we check each trait if it's a default one deep in the generator-enabled code of MessageContext; we could make this check much earlier.
  • returning an empty errors array from each format() call : bug 1301198

Hypothesis: Async is slow

Do we yield to other code in the event loop queue when we call then()? Unlikely. In the scenario where we trigger document translation on DOMContentLoaded, the IO has finished much earlier and the l10n.interactive promise is resolved. then() callbacks are executed in microtasks on the same turn of the event loop without yielding to other code in the queue.

Testing Strategies
  • Promise.resolve().then(() => performance.mark('…')) instead of translateRoot()

Hypothesis: Lazy-to-actual XULElements conversion

Are we working against XUL layout optimizations? XUL elements are first created as XULPrototypeElements which only carry just enough information to be useful for layout. When they're accessed from JS code they are converted to actual XULElements. The hypothesis is that there are at least two things when creating a new browser.xul window that trigger this conversion before tpaint: our code and something else, let's call it X. The first piece of code to make the call pays the lazy-to-actual tax. If it's us (because we start very early, like when we trigger on MozBeforeLayout), we take a lot of time (see NodeList Iteration), but still finish pretty early, then X runs faster and tpaint happens in 270 ms (on stas's machine). If it's not us (like when we trigger on DOMContentLoaded) then X takes more time; we run later but finish faster (we take 20 ms to translate everything), and... tpaint hapens in 270 ms. For tpaint it only matters if the tax has been paid. Below are the rough numbers in milliseconds:

 Trigger                 l20n-start  io-start  io-end  trigger/translateRoot-start  formatting-start (async)  formatting-end  translateRoot-end  tpaint
 ------------------------------------------------------------------------------------------------------------------------------------------------------
 DOMContentLoaded                 8        10      35                          160                       165             175                180     270
 MozBeforeLayout                  8        10      35                           40                        65             135                140     270

Hypothesis: DOM translation is slow

This is similar to Formatting Translations, but larger in scope. Are we CPU-bound during document translation? translateRoot from start to finish is taking 20 ms on stas's machine and it's very consistent (in the DOMContentLoaded scenario). This lines up with thefollowing tpaint numbers:

                        DOMConentLoaded  translateRoot-end  tpaint
 -----------------------------------------------------------------
 central                            160                        250
 larch                              160                180     270

I drilled down on these 20 ms in Cleopatra:

  1. node iteration: 5ms,
  2. juggling nodes into a map keys by Localization: 3ms,
  3. formatting: 8ms,
  4. applyTranslations: 4ms.
Testing strategies
  • replace formatEntities with formatEntity; H4 shows async is not a problem and we could shave off a few milliseconds from #2 above,
    • passing an array of many promises to Promise.all turns out to be costly: 5-7 ms.
    • if we'd want to apply translations individually, we'd need to disconnect/reconnect the observer for each node and that seems to cost ~3ms.
    • so even with groupElementsByLocalization removed, we end up with +10ms.
  • replace waiting for a trigger (DOMContentLoaded, MozBeforeLayout) with a MutationObserver; if we always take 20ms maybe we can take it earlier; if it's too early however we might run into the problem described in H5.
    • because of how XUL documents are created the MutationObserver creates a lot of MutationRecords (1000+), one for each node; see bug 1294197; the perf hit is huge and hard to measure but probably around 200ms.
  • try microoptimizations:
    • errors.push(...moreErrors) might be expensive,
    • for-of loops in hot paths might be expensive,
    • detect null values in the parser.
    • I tested all three together and I was able to get down from 20ms to 16ms on my machine. try also shows small perf wins but they're too small to reason about reliably (-8ms on tpaint looks pretty accurate though assuming try has slower machinces)
  • if node iteration is fast (H2) maybe we could make one qSA for each known Localization rather than make just one and then group nodes by l10n?

Hypothesis: Returning errors from MessageContext is slow

I noticed on Cleopatra that entitiesFromContext takes ~6ms but it only spends 1ms in format. Are we paying for transforming the data structure that we get from the parser into a { value, attrs } object that the XUl bindings consume? It looks like this is MessageContext.format's fault after all. Creating an empty errors array is costing us 2-3 ms. If I pass the array as an argument to append errors to, the whole thing takes 10-11 ms (13-14 ms before). I filed bug 1301198.

Hypothesis: I/O cache impacts tpaint

Because tpaint launches 20 windows, 19 of them use cached I/O. We disabled the cache to see how it impact tpaint results. The results indicate that there's no impact of I/O cache on tpaint - disabling it doesn't change results.

Hypothesis: Caching parsed entries instead of strings will impact tpaint

On larch we cache the strings we load (see Hypothesis I/O cache), but we do parse the string every time. Our census indicates that parsing costs around 5ms, but we also have to copy a large string between L10nRegistry JSM, bindings, back to IntlMessageContext JSM where we parse it and then carry it back to bindings.

The results indicate that moving parsing to be done within L10nRegistry and cached does shave off ~9ms on tpaint and 6ms on tpaint(e10s). That indicates that there's more than just parsing time that we shave off, but we also store more in memory (JSON objects instead of strings) and it mostly impacts tpaint test (opening the same window 20 times) rather than actual real life user experience (launching UI once).