L20n/Firefox/Performance

From MozillaWiki
< L20n‎ | Firefox
Revision as of 00:22, 9 September 2016 by Gandalf (talk | contribs)
Jump to navigation Jump to search

Summary

At the Reykjavik workweek, we drilled down on the following numbers on gandalf's laptop:

  • 5ms parsing
  • 4ms getting resource links
  • 4ms creating Localization object, create permutation of Bundles, starting IO
  • 4ms getTranslatables. queryElementSelector cost
  • 4ms apply translations, inserting into DOM
  • 3ms formatting to strings
  • 2ms create chrome resource bundle
  • 6ms .... connecting the dots

At the end of Sprint 5 we merged mozilla-central into larch and re-rum Talos tests. The results show the following tests regressing in performance:

  • sessionrestore - ~8% hit
  • tpaint - ~20% hit
  • ts_paint - ~8% hit

Hypothesis 1: IO Is Slow

Are we IO-bound? We start fetching resources almost 60ms after we start l20n.js. We can start earlier. We also seem to be blocked on the IO front and the return of the results come around 150ms when executed at the point where we do this, while without load the IO takes only 3-4ms. It may be that it's because we're async or within Promises. Let's try to make it sync.

Testing Strategies
  • hardcode ftl strings in l20n-chrome-xul
  • execute a sync IO call for all resources
  • wrap it in Promise but preserve sync
  • try to execute it at different positions
  • see why NetUtil is slower than XHR
  • ask for API extension to get a better priority for l10n files in the ZipReader queue

It doesn't really matter when exactly the IO starts. In (A) all of L20n's initialization is run after the documentReady() promise resolves. In (B) the IO starts as soon as possible (as soon as we can query for <link rel="localization">).

 (A) IO starts: 100 ms, IO ends: 140 ms,
 (B) IO starts:  40 ms, IO ends: 140 ms.

In (B) we still need to guard the initial translation of the document with documentReady(). The reason for this is that we can't rely on the unguarded MutationObserver in XUL documents: bug 1294197.

Hypothesis 2: NodeList Iteration Is Slow

Are we DOM-bound? Iterating over the [data-l10n-id] NodeList is slow: taking 25-30 ms on a 114-element-long NodeList (on 3rd+ launch). It doesn't matter if the iteration is done via a regular for loop, for…of or Array.from. See getTranslatables.

Testing Strategies
  • Use getElementByClassName
  • Iterate twice, second time right after the first one

I tested the performance of querySelectorAll during startup:

First time
  • querySelectorAll: 0.09 ms
  • NodeList.length: 0.06 ms
  • Array.from: 26.61ms
Second time
  • querySelectorAll: 0.12 ms
  • NodeList.length: 0.07 ms
  • Array.from: 0.22ms

I doesn't matter if we're using querySelectorAll or getElementsByClassName (with a special class for this purpose). Hypothesis: we're the first code that requests these DOM nodes and it's slow. Subsequent iterations are very fast.

After fixing bug 1298974 the NodeList iteration isn't a bottleneck anymore. It turned out XUL documents have different readystates than HTML and l20n's logic was being run too early when the DOM wasn't ready yet. With the fix, the first iteration takes less than 5 ms.

Hypothesis 3: Formatting Translations Is Slow

Are we CPU-bound? formatWithFallback is taking 20-25 ms. It seems that much of that is being spent handling errors (concatenating them with previously reported ones).

Testing Strategies
  • Hardcode formatted translations in l20n-chrome-xul
  • Throw errors and collect them in a try…catch.

I found three major bottlenecks:

  • destructuring assignment is slow in hotter loops,
  • formatting traits bails out of the simple-string optimization because the entire trait object is passed to format() instead of its val,
  • formatting entities with null values and no default traits is slow because we check each trait if it's a default one deep in the generator-enabled code of MessageContext; we could make this check much earlier.

I filed bug 1296618 to fix these.

Hypothesis 4: Async is slow

Do we yield to other code in the event loop queue when we call then()? Unlikely. In the scenario where we trigger document translation on DOMContentLoaded, the IO has finished much earlier and the l10n.interactive promise is resolved. then() callbacks are executed in microtasks on the same turn of the event loop without yielding to other code in the queue.

Testing Strategies
  • Promise.resolve().then(() => performance.mark('…')) instead of translateRoot()

Hypothesis 5: Lazy-to-actual XULElements conversion

Are we working against XUL layout optimizations? XUL elements are first created as XULPrototypeElements which only carry just enough information to be useful for layout. When they're accessed from JS code they are converted to actual XULElements. The hypothesis is that there are at least two things when creating a new browser.xul window that trigger this conversion before tpaint: our code and something else, let's call it X. The first piece of code to make the call pays the lazy-to-actual tax. If it's us (because we start very early, like when we trigger on MozBeforeLayout), we take a lot of time (see Hypothesis 2!), but still finish pretty early, then X runs faster and tpaint happens in 270 ms (on stas's machine). If it's not us (like when we trigger on DOMContentLoaded) then X takes more time; we run later but finish faster (we take 20 ms to translate everything), and... tpaint hapens in 270 ms. For tpaint it only matters if the tax has been paid. Below are the rough numbers in milliseconds:

 Trigger                 l20n-start  io-start  io-end  trigger/translateRoot-start  formatting-start (async)  formatting-end  translateRoot-end  tpaint
 ------------------------------------------------------------------------------------------------------------------------------------------------------
 DOMContentLoaded                 8        10      35                          160                       165             175                180     270
 MozBeforeLayout                  8        10      35                           40                        65             135                140     270

Hypothesis 6: DOM translation is slow

This is similar to H3, but larger in scope. Are we CPU-bound during document translation? translateRoot from start to finish is taking 20 ms on stas's machine and it's very consistent (in the DOMContentLoaded scenario). This lines up with thefollowing tpaint numbers:

                        DOMConentLoaded  translateRoot-end  tpaint
 -----------------------------------------------------------------
 central                            160                        250
 larch                              160                180     270

I drilled down on these 20 ms in Cleopatra:

  1. node iteration: 5ms,
  2. juggling nodes into a map keys by Localization: 3ms,
  3. formatting: 8ms,
  4. applyTranslations: 4ms.
Testing strategies
  • replace formatEntities with formatEntity; H4 shows async is not a problem and we could shave off a few milliseconds from #2 above,
    • passing an array of many promises to Promise.all turns out to be costly: 5-7 ms.
    • if we'd want to apply translations individually, we'd need to disconnect/reconnect the observer for each node and that seems to cost ~3ms.
    • so even with groupElementsByLocalization removed, we end up with +10ms.
  • replace waiting for a trigger (DOMContentLoaded, MozBeforeLayout) with a MutationObserver; if we always take 20ms maybe we can take it earlier; if it's too early however we might run into the problem described in H5.
    • because of how XUL documents are created the MutationObserver creates a lot of MutationRecords (1000+), one for each node; see bug 1294197; the perf hit is huge and hard to measure but probably around 200ms.
  • try microoptimizations:
    • errors.push(...moreErrors) might be expensive,
    • for-of loops in hot paths might be expensive,
    • detect null values in the parser.
    • I tested all three together and I was able to get down from 20ms to 16ms on my machine. try also shows small perf wins but they're too small to reason about reliably (-8ms on tpaint looks pretty accurate though assuming try has slower machinces)
  • if node iteration is fast (H2) maybe we could make one qSA for each known Localization rather than make just one and then group nodes by l10n?

Hypothesis 7: Returning errors from MessageContext is slow

I noticed on Cleopatra that entitiesFromContext takes ~6ms but it only spends 1ms in format. Are we paying for transforming the data structure that we get from the parser into a { value, attrs } object that the XUl bindings consume? It looks like this is MessageContext.format's fault after all. Creating an empty errors array is costing us 2-3 ms. If I pass the array as an argument to append errors to, the whole thing takes 10-11 ms (13-14 ms before). I filed bug 1301198.