Services/Server/Deployments/Sync/1.5

From MozillaWiki
Jump to: navigation, search

MAIN FEATURES

Constants served out of memcache - gives us the ability to change constants quickly and indirectly, including the ability to down nodes and back off users from them.

Time-to-live - records will (optionally) have a ttl value, after which they won't be returned and we can safely clean them out of the system.

Quotas - We'll start actively checking and enforcing quotas, including adding an API call to give you usage data for each of your collections


TRANSITION STEPS

alter available_nodes table on sreg

The available_nodes table takes on a larger role in the new system, as described by the attached diagram. The changes are documented in Bug 599133, and can be done at any time.

Rollback plan: In the unlikely event that this causes problems, we should really fix the code that needed specific column order (and take me out back and shoot me). Failing that, simply drop the new columns. Note that an error here will only prevent new users from getting a node.

bug 605713

alter all user data tables

We're adding a ttl field to the user data. This means that all the user tables will need to be altered to add ttl int (and I believe that jv wants to remove an index, which can be done at the same time).

The tables can be updated serially at any time, and plans should be done to kick this off, since it'll be a lengthy process to do all of them. There should be almost no downtime while doing this, but we should test a similar table.

Rollback plan: Same as altering available nodes.

bug 604132

tag release branch

This'll be the first 1.5.date tag

Rollback plan: none. If there's a problem here, we'll just tag a new branch

bug 605703

hg move

We'll be moving the 1.0 directory lock, stock and barrel to 1.1 in hg. This will briefly break any users who are pulling from tip rather than stable branch, but that shouldn't be a big issue.

Rollback plan: hg rollback if there's a problem. Should not be user-facing unless someone happens to pull the code while we're fixing.

bug 605702

add apache path

We'll need to change the paths for apache. There are a couple options here. We could do it in Zeus, but path of least resistance would seem to be to fix the apache alias for 1.0 to point at 1.1 and add a second alias for the 1.1 path, pointing to the same script. Need to consult with ops for plan.

Rollback plan: Restore the old apache config

bug 605714

cleanup script

Because quotas will now be enabled, we need to run the current cleanup script before launch, and on a regular basis until we're satisfied that enough people have moved off of 1.0 and are using ttl fields correctly. This script should obey ttl logic, too

Rollback plan: N/A. If this doesn't work, then the script needs fixing

bug 604134 for TTL-based cleanup bug 593191 for old-style cleanup

ttl script

We also need to create and test a script to purge expired ttl fields from the db nightly. This is a simple adaptation of the cleanup script, and both will need to be run for a little while.

Rollback plan: N/A. This is here to note that we need to build it.

bug 605717

security and ops review of memcache constants ecosystem

The new memcache constants requires an exposed API to the admin table, as well as a webpage to talk to that API and crons to push the data to the sync servers. This is all new infrastructure, and since it involves some sensitive data, should be reviewed by security. Operations will need to decide where and how they want to deploy it all.

Rollback plan: N/A

bug 605719

load tests

The new server needs to be pushed to stage for load testing. While we believe that the changes should have only a small impact on the system, load tests will show us if there's anything degenerate we need to worry about.

Rollback plan: N/A

bug 605721

functional test plan and qa signoff

We'll need to define ways for QA to test these new components of the system. Memcache constants (downing and backing off a node) can be done in coordination with ops. TTL needs to be done by setting a very small window (which may require a special client change) and quotas can be tested by setting the number very low on the stage servers.

Rollback plan: N/A

bug 605723

TIMELINE (draft)

  • Engineering
    • Already code complete on sync/admin DB side, will complete branching and tagging on 10/20 (Toby)
    • Remaining bit is cleanup script for TTL, should be ready by end of week (10/22) (Toby)
  • Operations
    • Changes staged and initial load tests running by the end of this week (10/22)
    • Security review complete by the end of next week (10/29)
    • Deployment plan complete by the end of next week (10/29)
  • Quality
    • Test plan complete by the end of this week (10/22) (Toby)
    • QA signoff by the end of next week (10/29) (mconnor to arrange)