Labs/Weave/Service/Scaling/Projects/Aragog
From MozillaWiki
< Labs | Weave | Service/Scaling
This project is first in a series that are aimed at scaling services.
The scope of this project is to get a production quality Sync 1.0 service capable of handling at least 1 million users.
At a high level, this means we need to do several things in the following categories (lists not prioritized):
Contents
Server scalability model
- Update load profile based on data from server meltdown.
From a quick back of the envelope calculation, our loads currently look like:
- 150 inserts/sec
- 170 selects/sec
- 38 deletes/sec
- 19 syncs/sec
- 5400 active users at a given time
We also know that 30K users / 1 master and 1 slave is too much.
- Document optimal cluster configuration.
- Document inflection points as best as possible based on above numbers.
Engineering changes
Client
- Better backoff (mconnor)
- Eliminate known problems with race conditions (dan)
- Brainstorm other ideas for graceful degradation of service (all)
Server
- The biggest challenge is that the process appears to be binary right now - the system is fine, or it's melting down, and there's no gradual increase that would indicate there's a problem that needs attention soon. We can try to figure out where that tipping point is, but also need to explore why the problem is so binary.
- Greater use of memcache on frequent small calls (notably /info/collections. We might also build it into /node/weave on the user side.
- Flatter DB structure, recognizing that this is currently a write-heavy system. Once it matches a more read-heavy system, we can extend the slave configuration to add reads.
Infrastructure changes
- Estimate hardware needed to support 1 million users at new estimated runlevel + add 25% extra capacity.
- Order any extra hardware needed asap.
Operational readiness
- Document/Update current weave cluster configuration. Be sure this includes any and all pertinent configuration information as well (for example, /etc/hosts file hacks that are downright scary).
- Create operations runbook.
- Committed owners/backups for key pieces of infrastructure.
Other
- Get staging environment set up and ready to go.
- Get extra capacity for 1.0 Beta and beyond based on #1 above.
- Train rest of IT during IT onsite (Oct?)
- Plan out h/w requirements through 1H10.
- Set up Cassandra to get next gen DB investigations going.