Necko/MobileCache/MicroBenchmarks

From MozillaWiki
Jump to navigation Jump to search

This Is Work In Progress - Information Is Incomplete

This page describes the cache-related microbenchmarks. Contact BjarneG (bherland@mozilla.com) for further information and/or the benchmarks themselves.

Purpose

The primary purpose of the microbenchmarks is to be tools for programmers evaluate the effect of code-changes on cache-performance. I.e. you run benchmarks, record results, apply code-changes, then run the benchmarks again. The goal is that comparing two results should give a reasonable idea how your changes affects cache-performance.

Implications of the above include

  • we need to be able to reproduce results rather consistently, i.e. we cannot have (too) different results from different runs
    • we might have to process raw results somehow; perhaps use some statistical properties
    • at the very least we must quickly be able to see variance/deviation in a dataset
  • what we measure (e.g. which units we use) is not really important - we just need something comparable

A secondary purpose of these microbenchmarks is to uncover performance-problems on new platforms. For example, enabling the disk-cache is rumored to hurt performance on mobile devices, based on running TP4 on one or two devices with and without disk cache enabled. With a reasonable set of benchmarks we can pinpoint more precisely what the problem is, which devices it affects, and do something about it.

Implications of the above include

  • we must somehow be able to identify a bad test-result (i,e, we cannot just report reproducible, abstract numbers - the results must be somehow understandable)
  • we need many measurements
    • search time vs cache sizes vs full/empty cache
    • read-time vs cache-type vs large/small data
    • write-time vs cache-type vs large/small data (writing is off main-thread, but will probably still impact performance on devices?)
    • creating, clearing, adjusting size, dooming entries
  • with many and detailed measurements we most likely also need a way to depict results graphically

Telemetry vs microbenchmarks

There has been some discussion about using telemetry instead of xpcshell-based microbenchmarks. Current thinking is that they are complementary: Telemetry is real-life browsing patterns on real-life platforms and environments, whereas microbenchmarks are artificial browsing patterns running on a machine in a lab-environment. It is impractical to experiment with code-changes using telemetry to measure the effect - a benchmark in the lab is much more practical for this. On the other hand, telemetry is the (only?) way to ensure that improvements also are valid in real-life. Moreover, with telemetry it is very difficult (if not impossible) to get the context of your measurements. For example, suppose you measure the time to evict a cache-entry from disk-cache; for this measurement to make sense you also need to know the number of entries in the cache and the total size of the cache. This context-information is hard to get using telemetry alone.

We plan to use telemetry as described below.

Identify areas of interest

Telemetry will provide lots of data and a really important job is to read and analyze this. It is expected that we will see unexpected patterns from telemetry, and such patterns should trigger microbenchmarks to investigate specific areas.

Tune parameters to make microbenchmarks more realistic

Original conditions in the lab are bound to be different from what users see in real life. However, in the lab we can control a number of parameters; by knowing realistic values for these parameters we can run tests in the lab under similar conditions as users, making our tests more realistic.

Examples of this include

  • Bandwidth: What's the typical bandwidth(s) seen by a user? In the lab we normally load resources from the local host...
  • Latency/RTT: How long does it typically take from sending a request to the server before the response starts to appear?
  • Cache-sizes: What's the typical cache-size out there?


Real-life verification of results from the lab

One way to define network-performance on a given platform is as the product of two factors: The browsing pattern (i.e. which urls are loaded in which sequence), and what exactly is measured. As discussed above, microbenchmarks and telemetry are inherently different with respect to the browsing pattern, but we can do our best to align telemetry and microbenchmarks wrt the second factor. Put in a different way: We should try to use the same code in telemetry and microbenchmarks to capture data, and we should ensure we interpret this data in the same way.

The major benefit of this is to have telemetry give us real-life verification after using synthetic, isolated and focused benchmarks in the lab. I.e. we can use synthetic test-patterns in the lab to identify and qualify code-changes, then after landing code-changes we should be able to see predictable effects of these changes via telemetry. If we measure performance differently in microbenchmarks and telemetry we may quickly end up "comparing apples and oranges", confusing ourselves.

Below is a pro/con list for using telemetry-code vs JS time-functions to capture data for microbenchmarks - feel free to add and comment.

Measure cache-performance using JS vs using telemetry-code
Time-function in JavaScript Telemetry timestamps compiled into the browser
Pros
  • Add and experiment with measurements without recompiling
  • Avoid JS and JS-thread jitter
  • Better granularity than JS-functions
  • Verify changes using results from real users - results are identical to those used in the lab (i.e. we avoid comparing apples and oranges)
Cons
  • Synthetic tests only, no way to directly verify effects in real life
  • Increased code-size (also in releases??)
  • Must have an idea up-front about timing-profile (histograms)
  • For timings which vary a lot we may get a large histogram and thus use a lot of data

The benchmarks

Each benchmark below is described and explained, and we also show output results from selected platforms. Note that results are provided as examples and information only; the benchmarks are meant for evaluating effect of code-changes, i.e. you run the benchmarks without and with your change and compare the results to evaluate the effect of your code.

test_stresstest_cache_with_timing.js

This test loads a number of different resources with uniform size. The resources are loaded sequentially in a tight loop and the test syncs with cache-io thread after finishing the loop. Each loop is executed for five different cache-configurations/setups:

  1. no cache enabled, loading all urls from server, not writing to cache (clearing cache before this step)
  2. memory-cache enabled, loading all urls from server, write them to cache (clearing cache before this step)
  3. memory-cache enabled, read all urls from cache (resources cached in previous step)
  4. disk-cache enabled, loading all urls from server, write them to cache (clearing cache before this step)
  5. disk-cache enabled, reading all urls from cache (resources cached in previous step)

This whole procedure is repeated over a given list of datasizes, clearing the cache between each repetition.

Time is measured in two ways: 1) for each individual load by by using the nsITimingChannel interface of the channel, and 2) by measuring run-time of the complete loop from the test itself.

Informally one can say that this test loads a large number of uniformly sized resources at the same time, somewhat similar to what happens when starting up firefox with lots of open tabs.

Generated datafiles

This test writes data into a number of files. Note that all files are appended, meaning that if they exist prior to the run, you will have strange data...

summary.dat
Format is multi-data with one datablock for each datasize run by the test. Each datablock has a header describing the datasize, then one row for each of the five configurations summing up the results from the loop. Datapoints are
  1. description of the configuration
  2. minimum value from the loop
  3. maximum value from the loop
  4. calculated mean over the results from the loop
  5. standard deviation over all data from the loop
  6. number of iterations in the loop
  7. time measured by the test using Data.now() for the loop
  8. time measured by the test using Data.now() for the loop, including syncing with the cache-io thread at the end
telemetry-<bytes>.dat
One file for each datasize run by the test. The format is multi-data with one datablock for each configuration. Each row represents one resource-load in actual order. The datapoints in each row are
  1. time from create to asyncOpen
  2. time from asyncOpen to cacheEntryAvailable
  3. time from cacheEntryAvailable to cacheReadStart
  4. time from cacheReadStart to cacheReadEnd

alternatively (if loaded from channel, not from cache)

  1. time from create to asyncOpen
  2. time from asyncOpen to cacheEntryAvailable
  3. time from cacheEntryAvailable to responseStart
  4. time from responseStart to responseEnd

I.e. referring to the Telemetry-data accumulated by nsLoadGroup, (2+3) corresponds to Telemetry::HTTP_PAGE_OPEN_TO_FIRST_FROM_CACHE alternatively Telemetry::HTTP_PAGE_OPEN_TO_FIRST_RECEIVED, and (2+3+4) corresponds to Telemetry::HTTP_PAGE_COMPLETE_LOAD.


Files to work with GnuPlot (requires Gnuplot >= 4.4)

settings.gnu
setup for the histogram, title of plot etc. Edit this to describe your platform.
plot-summary.gnu
Plots mean-values from summary.dat in one histogram, grouped by first column. Standard deviation in the dataset is indicated as an errorline. This is the recommended starting-point for studying results from a test-run.
plot-minmax-summary.gnu
Like above but indicates min and max values from the dataset instead of stddev.
plot-summary-per-size.gnu and plot-minmax-summary-per-size.gnu
Plots summaries like the two plots above, but makes one plot for each datasize to avoid scaling-issues in the plot.
plot-summary-from-javascript.gnu
This is WIP. Plots summaries of average loading-times seen from JS.
plot-telemetry.gnu
You need to copy the file "telemetry-<size>.dat" to "telemetry.dat" for the size you want to study. Plots detailed information for each resource loaded, one plot for each configuration. The format used is a rowstacked histogram where each column shows total time from creating the channel to loading has finished. Note that columns show time for each load separately (all times starting at 0) as opposed to a real time-line for the dataset.

test_timing_cache.js (Obsolete)

This benchmark measures time to load a resource from a server and from the cache, as well as time for call-overhead and clearing the cache.

The approach is to repeat a single load in a loop for some time and report the average time for one iteration. This eliminates timer-granularity issues and should smoothen out random jitter in the system, but the weakness is that IO-caching at any level will influence disk-cache results. Moreover, memory-cache results are likely to be influenced by HW caches and memory-paging in the OS. In short, this benchmark does not implement a very realistic access-pattern for cache-entries, hence its practical value may be limited. It can be useful, though, to study changes unrelated to the speed of the cache media, e.g. code to optimize datastructures or algorithms.

Note that there are two different tests for cache-misses: One which clears the cache in each iteration, and one where the response has a header which prevents it from being cached. (The first is mainly present for completeness - it makes a clearCache() call in each iteration, which is not very likely to happen in real life.) One could also imagine a third approach to emulate cache-misses; suffix the url with a different query-string in order to load a new resource in each iteration. However, this doesn't work because it hits the max number of open connections and times out. (See test below though, where we can control this limit in a different manner.)

All operations are reported for different data-sizes, as well as with no cache enabled, memory-cache enabled only and disk-cache enabled only. (Note that, theoretically, searching for an entry in a cache should take some time, hence we measure also with no cache enabled.)

Some results from a Linux server running 64-bit Ubuntu 10.4

Memory-cache capacity: 51200 Kb / Disk-cache capacity: 1048576 Kb
Max #connections=256 / max per server=256
                                       Total time      #runs   avg time/run
=== Set datasize 50 bytes
Calloverhead            (no-cache)      10016           14594   0.69
Calloverhead+ClearCache (no-cache)      10015           11647   0.86
CacheMiss (nocache)     (no cache)      10047           368     27.30
Calloverhead            (mem-cache)     10015           14532   0.69
Calloverhead+ClearCache (mem-cache)     10016           11616   0.86
CacheMiss (clrcache)    (mem-cache)     10018           360     27.83
CacheMiss (nocache)     (mem-cache)     10032           367     27.34
CacheHit                (mem-cache)     10017           4674    2.14
Calloverhead            (disk-cache)    10015           13954   0.72
Calloverhead+ClearCache (disk-cache)    10017           6723    1.49
CacheMiss (clrcache)    (disk-cache)    10033           346     29.00
CacheMiss (nocache)     (disk-cache)    10032           357     28.10
CacheHit                (disk-cache)    10017           4709    2.13
=== Set datasize 1024 bytes
CacheMiss (nocache)     (no cache)      10020           339     29.56
CacheMiss (nocache)     (mem-cache)     10037           351     28.60
CacheHit                (mem-cache)     10019           3634    2.76
CacheMiss (nocache)     (disk-cache)    10048           345     29.12
CacheHit                (disk-cache)    10018           3535    2.83
=== Set datasize 51200 bytes
CacheMiss (nocache)     (no cache)      10095           103     98.01
CacheMiss (nocache)     (mem-cache)     10023           104     96.38
CacheHit                (mem-cache)     10070           306     32.91
CacheMiss (nocache)     (disk-cache)    10110           104     97.21
CacheHit                (disk-cache)    10054           301     33.40
=== Set datasize 524288 bytes
CacheMiss (nocache)     (no cache)      10277           14      734.07
CacheMiss (nocache)     (mem-cache)     10312           14      736.57
CacheHit                (mem-cache)     10443           31      336.87
CacheMiss (nocache)     (disk-cache)    10283           14      734.50
CacheHit                (disk-cache)    10451           31      337.13

The important numbers are found in the rightmost column. Note that the first block (datasize 50 bytes) includes more results than the other blocks. "Pure Overhead" measures the call-overhead from JavaScript, and "Overhead+ClearCache" measures the time to Clear the cache from JavaScript. These numbers are unrelated to the size of the data loaded, hence reported only once.

Results above indicate that reading from the disk-cache is about as fast as reading from memory-cache, which probably is due to efficient IO-caching on this platform. Note also that cache-hits for small entries are relatively much faster than cache-hits for larger entries (i.e. for small entries there is a factor 10 whereas for large entries we see a factor 2-3).

There is no conclusive evidence that searching a disk or memory cache takes any significant amount of time. However, the caches in this test are pretty much empty so we should create a separate test for this, making sure the cache contains some number of entries before we measure. (See #Wanted_tests below)

Some raw results from my Nexus S. I find it weird that this device outperforms the Linux server when the data-size becomes larger... any insight is appreciated

Memory-cache capacity: 51200 Kb / Disk-cache capacity: 1048576 Kb
Max #connections=256 / max per server=256
                                       Total time      #runs   avg time/run
=== Set datasize 50 bytes
Calloverhead            (no-cache)      10016           10878   0.92
Calloverhead+ClearCache (no-cache)      10016           9482    1.06
CacheMiss (nocache)     (no cache)      10020           321     31.21
Calloverhead            (mem-cache)     10015           10975   0.91
Calloverhead+ClearCache (mem-cache)     10016           8937    1.12
CacheMiss (clrcache)    (mem-cache)     10036           371     27.05
CacheMiss (nocache)     (mem-cache)     10021           348     28.80
CacheHit                (mem-cache)     10016           4639    2.16
Calloverhead            (disk-cache)    10016           11426   0.88
Calloverhead+ClearCache (disk-cache)    10043           213     47.15
CacheMiss (clrcache)    (disk-cache)    10034           117     85.76
CacheMiss (nocache)     (disk-cache)    10019           347     28.87
CacheHit                (disk-cache)    10017           4206    2.38
=== Set datasize 1024 bytes
CacheMiss (nocache)     (no cache)      10049           332     30.27
CacheMiss (nocache)     (mem-cache)     10024           339     29.57
CacheHit                (mem-cache)     10017           3501    2.86
CacheMiss (nocache)     (disk-cache)    10031           327     30.68
CacheHit                (disk-cache)    10017           3308    3.03
=== Set datasize 51200 bytes
CacheMiss (nocache)     (no cache)      10064           119     84.57
CacheMiss (nocache)     (mem-cache)     10074           119     84.66
CacheHit                (mem-cache)     10049           396     25.38
CacheMiss (nocache)     (disk-cache)    10078           116     86.88
CacheHit                (disk-cache)    10031           426     23.55
=== Set datasize 524288 bytes
CacheMiss (nocache)     (no cache)      10572           19      556.42
CacheMiss (nocache)     (mem-cache)     10052           18      558.44
CacheHit                (mem-cache)     10274           45      228.31
CacheMiss (nocache)     (disk-cache)    10167           18      564.83
CacheHit                (disk-cache)    10177           43      236.67

This is one way to represent the results

Summary of results over different platforms
No cache Memory-Cache Disk-Cache
50 1024 51200 524288 50 1024 51200 524288 50 1024 51200 524288
Ubuntu 64bit server Clear cache 0.86 0.86 1.49
Miss 27.30 29.56 98.01 734.07 27.34 28.60 96.38 736.57 28.10 29.12 97.21 734.50
Hit n/a 2.14 2.76 32.91 336.87 2.13 2.83 33.40 337.13
Nexus S Clear cache 1.06 1.12 47.15
Miss 31.21 30.27 84.57 556.42 28.80 29.57 84.66 558.44 28.87 30.68 86.88 564.83
Hit n/a 2.16 2.86 25.38 228.31 2.38 3.03 23.55 236.67
Samsung Galaxy S (from Geoff) Clear cache 1.89 1.87 1339.75
Miss 31.33 27.48 74.04 524.20 27.27 28.32 75.05 529.32 28.04 28.43 93.92 612.59
Hit n/a 2.67 3.04 21.67 209.69 3.03 3.37 24.37 218.61


and this is another representation. I prefer the latter since it allows me to easier study the effect of different cache-configurations on a particular testcase

Summary 2 of results over different platforms
50 1024 51200 524288
Ubuntu 64bit server Miss (no-cache) 27.30 29.56 98.01 734.07
(mem-cache) 27.34 28.60 96.38 736.57
(disk-cache) 28.10 29.12 97.21 734.50
Hit (mem-cache) 2.14 2.76 32.91 336.87
(disk-cache) 2.13 2.83 33.40 337.13
Nexus S Miss (no-cache) 31.21 30.27 84.57 556.42
(mem-cache) 28.80 29.57 84.66 558.44
(disk-cache) 28.87 30.68 86.88 564.83
Hit (mem-cache) 2.16 2.86 25.38 228.31
(disk-cache) 2.38 3.03 23.55 236.67
Samsung Galaxy S (from Geoff) Miss (no-cache) 31.33 27.48 74.04 524.20
(mem-cache) 27.27 28.32 75.05 529.32
(disk-cache) 28.04 28.43 93.92 612.59
Hit (mem-cache) 2.67 3.04 21.67 209.69
(disk-cache) 3.03 3.37 24.37 218.61

test_timing_cache_2.js (Obsolete)

This benchmark loads a number of resources with different cache-keys and then waits for all cache-io to finish. Then it loads all resources again (retrieving them from cache this time). Both sequences are measured and reported for various entry-sizes and cache-configurations.

This access-pattern is slightly more realistic than in the benchmark described above; a page is likely to load some number of resources from the (same) server, and if the user visits the page second time, those resources are loaded again (from cache this time). The benchmark aims at emulating this pattern in a simple way.

Some results from a Linux server running 64-bit Ubuntu 10.4

Memory-cache capacity: 51200 Kb / Disk-cache capacity: 1048576 Kb
Max #connections=256 / max per server=256
                                       Total time      #runs   avg time/run
Setting datasize 50 bytes
CacheMiss (nocache) (no cache)          2803            101     27.75
CacheMiss (q-url)   (mem-cache)         2767            101     27.40
CacheHit            (mem-cache)         205             101     2.03
CacheMiss (q-url)   (disk-cache)        2783            101     27.55
CacheHit            (disk-cache)        186             101     1.84
Setting datasize 1024 bytes
CacheMiss (nocache) (no cache)          2884            101     28.55
CacheMiss (q-url)   (mem-cache)         2570            101     25.45
CacheHit            (mem-cache)         269             101     2.66
CacheMiss (q-url)   (disk-cache)        2910            101     28.81
CacheHit            (disk-cache)        271             101     2.68
Setting datasize 51200 bytes
CacheMiss (nocache) (no cache)          9546            101     94.51
CacheMiss (q-url)   (mem-cache)         9536            101     94.42
CacheHit            (mem-cache)         3274            101     32.42
CacheMiss (q-url)   (disk-cache)        9543            101     94.49
CacheHit            (disk-cache)        3274            101     32.42
Setting datasize 524288 bytes
CacheMiss (nocache) (no cache)          75942           101     751.90
CacheMiss (q-url)   (mem-cache)         75840           101     750.89
CacheHit            (mem-cache)         27799           101     275.24
CacheMiss (q-url)   (disk-cache)        79318           101     785.33
CacheHit            (disk-cache)        31333           101     310.23

The interesting numbers are found in the rightmost column of the output.

Results above suggest that loading entries up to a certain size from the server (the xpcshell-handler in this case) is rather consistent regardless of whether we use no cache, mem-cache or a disk-cache. Also, when loading entries up to a certain size from cache, the disk-cache seems to be as fast as the memory-cache. This IMO indicates that IO on this platform is pretty efficient (up to a certain limit) because we would always expect to spend more time when using the disk-cache than when using mem-cache. Observe that we need 512K-sized resources to see a clear difference between using disk and memory. (The exact limit here is not identified, but it may not very interesting either since it is probably platform-specific).

Also observe that when reading from the cache, small entries load relatively much faster than larger entries. (Memory-page size?)

Results from my Nexus S. Like above, the performance on large data is surprisingly good compared to the Linux server... any insight?

Memory-cache capacity: 51200 Kb / Disk-cache capacity: 1048576 Kb
Max #connections=256 / max per server=256
                                       Total time      #runs   avg time/run
Setting datasize 50 bytes
CacheMiss (nocache) (no cache)          2724            101     26.97
CacheMiss (q-url)   (mem-cache)         2249            101     22.27
CacheHit            (mem-cache)         126             101     1.25
CacheMiss (q-url)   (disk-cache)        2314            101     22.91
CacheHit            (disk-cache)        146             101     1.45
Setting datasize 1024 bytes
CacheMiss (nocache) (no cache)          2190            101     21.68
CacheMiss (q-url)   (mem-cache)         2206            101     21.84
CacheHit            (mem-cache)         177             101     1.75
CacheMiss (q-url)   (disk-cache)        2382            101     23.58
CacheHit            (disk-cache)        189             101     1.87
Setting datasize 51200 bytes
CacheMiss (nocache) (no cache)          7388            101     73.15
CacheMiss (q-url)   (mem-cache)         7582            101     75.07
CacheHit            (mem-cache)         2205            101     21.83
CacheMiss (q-url)   (disk-cache)        7749            101     76.72
CacheHit            (disk-cache)        2226            101     22.04
Setting datasize 524288 bytes
CacheMiss (nocache) (no cache)          57251           101     566.84
CacheMiss (q-url)   (mem-cache)         57939           101     573.65
CacheHit            (mem-cache)         22888           101     226.61
CacheMiss (q-url)   (disk-cache)        62850           101     622.28
CacheHit            (disk-cache)        22626           101     224.02

Wanted tests

Unordered and informal list of tests we may also want to create

variation of the other tests where we start out with a full cache
i.e. investigate if there is a significant impact of evicting entries while we use the cache
measure time to search a cache
mem- and disk-only, but also the combination of them I believe. Make sure we search the cache without finding the entry.
identify entry-size where disk-cache is clearly slower than mem-cache
probably OS- or device-specific but it may be worth finding in case we can learn something general from it
driver which can take a list of resources and measure time to load them all [Bjarne works on this]
we can extract this info from e.g. Talos or by using telemetry, and it should probably be hierarchical (some resources causes other to load)