Labs/Weave/Service/Scaling/Projects/Aragog

From MozillaWiki
Jump to: navigation, search

This project is first in a series that are aimed at scaling services.

The scope of this project is to get a production quality Sync 1.0 service capable of handling at least 1 million users.

At a high level, this means we need to do several things in the following categories (lists not prioritized):

Server scalability model

  1. Update load profile based on data from server meltdown.

From a quick back of the envelope calculation, our loads currently look like:

  • 150 inserts/sec
  • 170 selects/sec
  • 38 deletes/sec
  • 19 syncs/sec
  • 5400 active users at a given time

We also know that 30K users / 1 master and 1 slave is too much.

  1. Document optimal cluster configuration.
  2. Document inflection points as best as possible based on above numbers.



Engineering changes

Client

  1. Better backoff (mconnor)
  2. Eliminate known problems with race conditions (dan)
  3. Brainstorm other ideas for graceful degradation of service (all)

Server

  1. The biggest challenge is that the process appears to be binary right now - the system is fine, or it's melting down, and there's no gradual increase that would indicate there's a problem that needs attention soon. We can try to figure out where that tipping point is, but also need to explore why the problem is so binary.
  2. Greater use of memcache on frequent small calls (notably /info/collections. We might also build it into /node/weave on the user side.
  3. Flatter DB structure, recognizing that this is currently a write-heavy system. Once it matches a more read-heavy system, we can extend the slave configuration to add reads.

Infrastructure changes

  1. Estimate hardware needed to support 1 million users at new estimated runlevel + add 25% extra capacity.
  2. Order any extra hardware needed asap.


Operational readiness

  1. Document/Update current weave cluster configuration. Be sure this includes any and all pertinent configuration information as well (for example, /etc/hosts file hacks that are downright scary).
  2. Create operations runbook.
  3. Committed owners/backups for key pieces of infrastructure.


Other

  1. Get staging environment set up and ready to go.
  2. Get extra capacity for 1.0 Beta and beyond based on #1 above.
  3. Train rest of IT during IT onsite (Oct?)
  4. Plan out h/w requirements through 1H10.
  5. Set up Cassandra to get next gen DB investigations going.