Labs/Weave/Service/Scaling

From MozillaWiki
< Labs‎ | Weave
Jump to: navigation, search

As our number of services increase and as we increase user adoption, we need to pay a lot of attention to scaling. Our architecture, operational procedures, infrastructure, communications etc. all have an impact on scaling and vice versa.

Note that most of these are written with the Sync service as an example (also because that is the current focus as we work towards a 1.0). In the near future, we will make these more generic for all Mozilla services.


High level service principles/requirements

The following is a list of high level service principles/requirements that we should use to guide our decisions:

  • Sync service should be available pretty much all the time to existing users, barring extreme/catastrophic failures (colo power outage, under sea cable link got busted, etc). Note: This is not recommending that the service be up 100% of the time, we will always have downtimes for maintenance etc.
  • Under extreme duress, service should degrade gracefully. This means, for example:
    • It's perfectly acceptable to slow (or perhaps even shut) down new user registrations before existing users start noticing performance issues.
    • It's acceptable for performance to slow down a bit as long as we don't lose user data.
    • Fast is better than slow. Slow is better than closed. (via http://alex.dojotoolkit.org/2009/08/some-orthodox-heresies/)
  • Pushing out updates for security fixes >> Pushing out updates for new features
  • Be prepared to deal with X% increase in load at short notice (a few hours to less than a day)

It is important to note that these are rather flat and don't have a strict hierarchy/prioritization. This is because based on the scenario, our prioritization might change. For example, in the case of a traffic spike, we might first try slowing/shutting down new user registrations. But in the case of a catastrophic failure, we don't necessarily have that option.

Engineering considerations

In the past, we have started documenting the architecture of the Weave cluster as well as putting together some rough numbers to help with load profiling and planning.

That is available at https://intranet.mozilla.org/Labs/Weave/Cluster_Scalability

In addition to that, here are some additional items that need to be taken into account.

These should be considered for the system as a whole and not just as independent client/server issues.

  • How do we handle bursts in traffic that may be due to:
    • A sharp increase in new users resulting in a ton of data.
    • A sharp increase in incoming data due to other reasons (we release new feature, we migrate user data causing full uploads, etc.)
  • How do we deal with replication delays causing degradation in performance and resulting in other cascading system failures
  • How do we deal with race conditions that may result in flood of data into the system, sync failures etc?
  • What are the indicators that we need to tee off on to realize we are getting into the danger zone? What are our current limits?
    • For example: when replication delays start getting to X seconds, we need to do Y.
    • Another example: when # of writes/sec on master starts getting to X/sec, we need to do Y.
  • What extra capacity should we have ready to go in short notice?
  • What indicators do we have to let the users know when they system may be under stress so we can set expectations that they may see slight performance degradations?


Operational considerations

In addition to the above engineering factors, we also need better operational readiness processes and procedures.

  • Development/release
    • Soft code freeze (at least 1 week before target release date)
    • More testing, especially before releases that make major changes (like 0.6)
    • Release on a Tues/Wed Thurs (we already do this).
  • Release planning
    • For each release, clearly communicate and plan for impact on both client/server side:
    • Impact on server (for example, will increase queries/sec by factor of X)
    • Impact on client (for example, response time for complex query Y will be increased by factor of Z)
    • Additionally, we should also figure out if there is any major impact on user data and whether a given release will require some downtime
    • WEPs, use them to document any operational issues that might arise. (For example, implementing this WEP is likely to increase the number of queries from X/user to Y/user).
  • Operations
    • Weave Operations is handled by Mozilla IT.
    • Service impacting issues should be filed as blocker bugs
    • oncall@mozilla.org should be notified of any other issues
  • Improve back-up knowledge/access to key systems
    • Currently our bus number on infrastructure stuff is 1 Mozilla IT is 24x7
    • Currently our bus number on server code/fixes is 1 (toby)
      • Need play to migrate to webdev
  • Handling downtime
    • Scheduled downtimes >> unscheduled downtimes
      • Published Downtimes: Tuesday & Thursdays 7pm - 11pm Pacific
    • Communicate widely before and after with appropriate details
      • Mozilla IT blogs and emails several newsgroups