Loop/Load Handling

From MozillaWiki
Jump to: navigation, search

To avoid overwhelming the various servers involved in the Loop service infrastructure, Firefox 34 will include a set of three load management mechanisms. This page describes these mechanisms.

Service Soft Start

See bug 1055319.

This mechanism allows Mozilla to gradually ramp up system load after the feature makes it into release. In a nutshell:

By default, Firefox 34 release and beta clients will have two prefs set to "true": loop.enabled and loop.throttled. Upon first startup, each client will select a random number in the range of 1 to 224-2 and write it into the "loop.soft_start_ticket_number" pref. Then, upon this and every subsequent startup, each client will check the value of the "loop.throttled" pref. If set to true, then the client checks the value of a DNS A record (tentatively "soft-start.loop.services.mozilla.com" -- see bug 1060809), which is required to be in the range - If the record is outside this range, or if there is an error retrieving the A record, then the client does not activate the Loop feature.

If the record is successfully retrieved, then the low 24 bits of the address are treated as a "now serving" number, and compared to the value stored in "loop.soft_start_ticket_number". If the value is strictly greater than the selected ticket number, then the feature is activated, and the "loop.throttled" pref is set to false (which will bypass this procedure for all subsequent startups).

This allows us to increase load on the system very gradually after launch. The recommended handling of this number is as follows:

  1. Ensure that the TTL for the DNS record is set to a relatively short value, so as to allow changes to propagate through the system rapidly. recommended value is in the range of 600 to 3600 seconds (10 minutes to an hour).
  2. When initially launched, set the load level to 10%. Leave it at that level for at least 24 hours and observe server load.
  3. If server utilization is sufficiently low, increase the load level incrementally, waiting at least 24 hours between each change to ensure that server load can settle.
  4. Once server load is ramped all the way to 100%, file a bug to remove the throttling logic from the Loop feature.

For easy reference, the following table calculates the IP address values for loads from 0% to 100%, in 5% increments:

Load (%) Load
(24-bit integer)
IP Address
0% 0
5% 838860
10% 1677721
15% 2516582
20% 3355443
25% 4194303
30% 5033164
35% 5872025
40% 6710886
45% 7549746
50% 8388607
55% 9227468
60% 10066329
65% 10905189
70% 11744050
75% 12582911
80% 13421772
85% 14260632
90% 15099493
95% 15938354
100% 16777215

I'm a crusty old perl programmer, so the suggestion I have for generating the IP address for an arbitrary load value looks like this; I'm sure there are more elegant solutions in other languages, perl being what it is. Simply replace the "50" at the end with the load level you'd like to get an IP address for:

perl -e 'print join(".",unpack("C*",pack("N",127<<24|int(((1<<24)-1)*(shift)/100))))."\n"' 50

This process repeats for Firefox 35, although instead of activating the feature, the throttle will be used to move the button from being in the customization palette by default to being in the toolbar by default.

Recommended Handling

Week of October 13 - While Firefox 34 is in Beta, we want to test the throttle's proper functioning, while not leaving the feature disabled for many users for a long time. Early beta of 33 had the feature active and in the default toolbar set, so we already have a good idea that we can handle the load of setting the throttle to 100%. To that end, we should set the throttle to 10% on initial release, and then increase it every second business day. My proposal is to increase by doubling every second day until we reach 100%; that is: 10% on day 1, 20% on day 3 (bug 1087397), 40% on day 5 (bug 1087610), and 100% on day 7 (bug 1087616).

Week of November 19 - As we approach the uplift of 34 to release, we want to reduce the throttle to a lower value, around 10% (bug 1103028). I propose that we do this several days out, to make sure we don't run into DNS caching issues.

Approximately November 19 - December 30 - Gradually increase the throttle (20% on Dec 8: bug 1108745; 40% on Dec 10: bug 1109777; 80% on Dec 17: bug 1112820; 100% on Dec 19: bug 1113838). Observe server load and behavior (Loop server, Simple Push server, and TokBox infrastructure). As we are convinced the servers are equipped to handle the load, we increase the throttle according to how much headroom we believe the servers have. On any given increase, I would not recommend increasing the load by more than a factor of 2. I would recommend giving at least 48 hours for the load to settle before any subsequent adjustments to the throttle value. During this time period, release users will gain access to the feature, and beta users will have the button appear in their default toolbar.

Week of January 5 - Leading up to the uplift of 35 to release, we again reduce the throttle to 10% (bug 1118312). Note that the throttle acts as a ratchet: once active, the service remains active. Reducing the throttle will not cause the feature to disappear from 34 browsers.

Approximately January 15 - February 15 - Repeat the throttle increase process described above. (20%: January 22, bug 1124908; 40%: February 17, bug 1133106; 100%: February 19 bug 1134230).

Approximately February 15 - bug 1073218 should land in Firefox 36, which will be in release around this date. We still want to give people time to upgrade before leaving users on 35 without a button in the default toolbar.

Date TBD - Remove the throttle DNS record. There is very little cost to keeping the DNS record around, but we don't want it to linger indefinitely. Once the percentage of users on Firefox 35 drops to an inconsequential level, we can pull this record out.

If you'd like to keep an eye on the throttle without having to jump through several hoops, you can install the simple Firefox and Thunderbird Add-On that I created to keep tabs on the value. It's a bit of a hack, in that it requires you to drag the display out of the customize palette to whereever you want it to be, and then to restart the browser before it starts working.

Servers to Watch

When in production, we should check on the health of the following servers before increasing the throttle, and use their state to determine how much we're comfortable increasing the value:

  • Loop Server (load, database size)
  • FxA server (load)
  • Simple Push server (load, connection count)
  • TokBox infrastructure (actual servers and metrics to be determined by TokBox)

Simple Push Server Load Distribution

See bug 1055139 and bug 1055143.

Currently, the push servers have been measured to handle up to 5 million users per cluster. With potentially several hundred million users ultimately using the Loop service, we need the ability to scale to use multiple simple push clusters.

To support this behavior, the Loop client will query the Loop server for the address of the Simple Push server to use, rather than using the Simple Push server configured in the client prefs. The loop server will maintain a list of Simple Push clusters, and hand out a random entry from this list each time a client requests the address of a Simple Push server. In this way, we can bring additional Simple Push servers online to accommodate the load required to support the Loop service.

Simple Push Server Load Reduction

See bug 1060610.

The final approach to reduce the load on the Simple Push infrastructure, especially as curious users investigate the meaning of a new button that has just appeared in their browser, is to defer setting up a simple push connection until the user performs an action that might result in receiving a call. This can happen if the user copies a link from the Loop panel (either with the "copy" button or by highlighting the link and using OS-specific means of copying to the clipboard), or if the user selects the "email link" button.

Users can also receive calls once they are logged in to an FxA account for Loop, so any logged-in user will maintain a simple push subscription also.

This prevents unnecessary simple push connections from being established by clients that aren't actually capable of receiving a call at any given moment.