Necko/MobileCache/MicroBenchmarks: Difference between revisions

Jump to navigation Jump to search
Line 27: Line 27:
=== Telemetry vs microbenchmarks ===
=== Telemetry vs microbenchmarks ===


There has been some discussion about using telemetry instead of xpcshell-based microbenchmarks. Current thinking is that they are complementary: Telemetry is real-life browsing patterns on real-life platforms and environments, whereas microbenchmarks are artificial browsing patterns running on a machine in a lab-environment. It is impractical to experiment with code-changes using telemetry to measure the effect - a benchmark in the lab is much more practical for this. On the other hand, telemetry is the (only?) way to ensure that improvements also are valid in real-life. Moreover, with telemetry it is very difficult (if not impossible) to get the context of your measurements. For example, suppose you measure the time to evict a cache-entry from disk-cache; for this measurement to make sense you also need to know the number of entries in the cache and the total size of the cache. This context-information is hard to get using telemetry alone.
There has been some discussion about using telemetry instead of xpcshell-based microbenchmarks. Current thinking is that they are complementary and that telemetry is real-life browsing patterns on real-life platforms and environments, whereas microbenchmarks are artificial browsing patterns running on a controlled platform in a lab-environment. Some points:


We plan to use telemetry as described below.
* It is impractical to experiment with code-changes using telemetry to measure the effect - a benchmark in the lab is much more practical for this.


==== Identify areas of interest ====
* Using telemetry it is very difficult (if not impossible) to get the context of your measurements. For example, suppose you measure the time to evict a cache-entry from disk-cache; for such a measurement to make sense you also need to know the number of entries in the cache and the total size of the cache. This context-information is hard to get using telemetry alone.
Telemetry will provide lots of data and a really important job is to read and analyze this. It is expected that we will see unexpected patterns from telemetry, and such patterns should trigger microbenchmarks to investigate specific areas.
 
* On the other hand, telemetry mirrors the experience of real users and is the (only?) way to ensure that improvements are valid in real-life.
 
Two very different factors determine cache-performance:
 
; Performance of the platform (efficiency of CPU, memory- and disk-speed etc): We deal with this by implementing efficient datastructures, various file-schemes, multi-threading etc. Measuring this is the focus of microbenchmarks.
; Browsing patterns (how often do users revisit sites, how deep in sites do they browse etc): We deal with this by scaling caches and implementing smart replacement-algorithms. This factor is not taken into account by microbenchmarks - only telemetry handles this.
 
Hence, current thinking is that telemetry and microbenchmarks are complementary mechanisms suitable for different purposes. Microbenchmarks will be used as described in the section above and telemetry will be used for the following
 
==== Identify general areas of interest ====
Telemetry will provide lots of data and an important job is to read and analyze this. We expect to see unexpected patterns from telemetry, and such patterns should trigger the creation of microbenchmarks to investigate specific areas.


==== Tune parameters to make microbenchmarks more realistic ====
==== Tune parameters to make microbenchmarks more realistic ====
Original conditions in the lab are bound to be different from what users see in real life. However, in the lab we can control a number of parameters; by knowing realistic values for these parameters we can run tests in the lab under similar conditions as users, making our tests more realistic.
Conditions in the lab are bound to be different from those seen by real users. However, in the lab we control most parameters and by knowing realistic values for these parameters we can tune our lab-setup to similar conditions, making our lab-testing more realistic. We can use telemetry to find such values. Examples of this include
 
Examples of this include


* Bandwidth: What's the typical bandwidth(s) seen by a user? In the lab we normally load resources from the local host...
* Bandwidth: What's the typical bandwidth(s) seen by a user? In the lab we normally load resources from the local host...
* Latency/RTT: How long does it typically take from sending a request to the server before the response starts to appear?
* Latency/RTT: How long does it typically take from sending a request to the server before the response starts to appear?
* Cache-sizes: What's the typical cache-size out there?
* Cache-sizes: What's the typical cache-size out there?


==== Real-life verification of results from the lab ====
==== Real-life verification of results from the lab ====
One way to define network-performance on a given platform is as the product of two factors: The browsing pattern (i.e. which urls are loaded in which sequence), and what exactly is measured. As discussed above, microbenchmarks and telemetry are inherently different with respect to the browsing pattern, but we can do our best to align telemetry and microbenchmarks wrt the second factor. Put in a different way: We should try to use the same code in telemetry and microbenchmarks to capture data, and we should ensure we interpret this data in the same way.
Telemetry monitors real-life efficiency of the cache, hence by monitoring the right values we can ensure that improvements we see in the lab also benefits real users. Exactly which measurements we need is not entirely clear yet (work is in progress to determine this).
 
The major benefit of this is to have telemetry give us real-life verification '''after''' using synthetic, isolated and focused benchmarks in the lab. I.e. we can use synthetic test-patterns in the lab to identify and qualify code-changes, then after landing code-changes we should be able to see predictable effects of these changes via telemetry. If we measure performance differently in microbenchmarks and telemetry we may quickly end up "comparing apples and oranges", confusing ourselves.


Below is a pro/con list for using telemetry-code vs JS time-functions to capture data for microbenchmarks - feel free to add and comment.
An important point is that in order to verify results like this we should measure the same (or at least very similar) values in microbenchmarks and in telemetry. We should also, of-course, measure other values from telemetry in order to cross-check our results, but in order to verify lab-improvements in real life we should align measurements. The rationale for this is as follows: Let's say we want to improve characteristic A. We make a benchmark which measure A, fix the code, and see a clear improvement of A in the lab so we push the code-fix. Let's say telemetry do not measure A but rather some other characteristic B. What if B did not improve (or even degraded)? Is it because A and B are unrelated? Or is it because A actually didn't improve in real-life? We want to first verify that A actually did improve in real-life, then we can discuss why B did not improve, and then decide whether the code-change is worth keeping. Such results will also increase our understanding of how the caching works in general.


{| border="1" cellpadding="5" cellspacing="0" align="center"
Thus, the suggested strategy is to first introduce a telemetry-probe to gather baseline-data for some characteristic we want to improve, then introduce a code-fix which improves the characteristic in the lab, then finally monitor changes from telemetry to verify that the improvement from the lab. In the last step, the particular characteristic should be monitored as well as other (more general) performance-probes for cross-checking.
|+'''Measure cache-performance using JS vs using telemetry-code'''
|-
!|
!| Time-function in JavaScript
!| Telemetry timestamps compiled into the browser
|-
!| Pros
||
* Add and experiment with measurements without recompiling
||
* Avoid JS and JS-thread jitter
* Better granularity than JS-functions
* Verify changes using results from real users - results are identical to those used in the lab (i.e. we avoid comparing apples and oranges)
|-
!| Cons
||
* Synthetic tests only, no way to '''directly''' verify effects in real life
||
* Increased code-size (<i>also in releases??</i>)
* Must have an idea up-front about timing-profile (histograms)
* For timings which vary a lot we may get a large histogram and thus use a lot of data
|-
|}


== The benchmarks ==
== The benchmarks ==
97

edits

Navigation menu