Labs/Weave/Service/Scaling/Projects/Aragog

This project is first in a series that are aimed at scaling services.

The scope of this project is to get a production quality Sync 1.0 service capable of handling at least 1 million users.

At a high level, this means we need to do several things in the following categories (lists not prioritized):

Server scalability model

From a quick back of the envelope calculation, our loads currently look like:

We also know that 30K users / 1 master and 1 slave is too much.

The biggest challenge is that the process appears to be binary right now - the system is fine, or it's melting down, and there's no gradual increase that would indicate there's a problem that needs attention soon. We can try to figure out where that tipping point is, but also need to explore why the problem is so binary.
Greater use of memcache on frequent small calls (notably /info/collections. We might also build it into /node/weave on the user side.
Flatter DB structure, recognizing that this is currently a write-heavy system. Once it matches a more read-heavy system, we can extend the slave configuration to add reads.

Estimate hardware needed to support 1 million users at new estimated runlevel + add 25% extra capacity.
Order any extra hardware needed asap.

Document/Update current weave cluster configuration. Be sure this includes any and all pertinent configuration information as well (for example, /etc/hosts file hacks that are downright scary).
Create operations runbook.
Committed owners/backups for key pieces of infrastructure.