CloudServices/Loop/Deploy

From MozillaWiki
Jump to: navigation, search

The Loop service is the server-side component of the Loop project.

It's composed of two parts: - loop-server, the Node.js service at https://github.com/mozilla/loop-server - loop-client, the "client" part composed of static files, served by our server

Both parts are currently deployed and served by the same instance servers.


See : https://wiki.mozilla.org/Loop

Contacts

  • Dev Team
    • loop-server
      • Tarek Ziadé <tarek@mozilla.com>
      • Alexis Metaireau <alexis@mozilla.com>
      • Rémy Hubscher <natim@mozilla.com>
    • loop-client (main devs)
      • Mark Banner <mbanner@mozilla.com>
      • Dan Mosedale <dmose@mozilla.com>
  • OPS
    • Benson Wong <mostlygeek@mozilla.com>
    • Bob Michelleto <bobm@mozilla.com>
  • QA
    • Richard Pappalardo <rpappalardo@mozilla.com> (Primary)
    • Karl Thiessen <kthiessen@mozilla.com> (Backup)

Deployment

There are three deployed environments. A fourth will be deployed later.


Dev

  • Host: https://loop-dev.stage.mozaws.net
  • Maintainer: DEVs
  • Tokbox mocked? NO.
  • Usage: Development and integration
  • Updates:
    • loop-client is updated automatically every hour and should match the latest master of the repository
    • loop-server is updated with the master branch by devs on a regular basis - or upon request. you can get the version by displaying the root URL of the server.
  • Access: ec2-user@loop.dev.mozaws.net

To use this in your browser go to about:config and edit loop.server changing the value to https://loop-dev.stage.mozaws.net and then restart your browser.

This environment can be used to test end-to-end scenario until the service hits the Stable channel.

Load-Stage

  • Host: https://loop.stage.mozaws.net/
  • Maintainer: OPS
  • Tokbox mocked? Yes (but can change, check the / endpoint for more info)
  • Usage: Server-side QA and Loadtesting with the mock server


This environment is used by QA and dev for load tests. The goal is to measure how many connections can be handled by the server and anticipate errors that might happen on high load.

tokbox created keys are not real ones, they are collected by a fake mock server. We deployed it at http://loop-delayed.dev.mozaws.net/

Real-Stage

  • Host: https://loop.stage.mozaws.net/
  • Maintainer: OPS
  • Mocked tokbox?: No
  • Usage: Client-side QA, Server-side QA and Loadtesting with a live third-party/partner server

Not yet commissioned. This environment will be used for end-to-end testing of the service once it hits the stable channel.

This server will be a perfect mirror of the production environment, updated with the tag of the upcoming release

  • (jabonacci) Correct me if I am wrong but we already have this. We are using a configurable Stage environment that can either point to a mock server (previous section) or point to a live server. So, we are able to do end-to-end testing. Host name is the same. Configuration is defined in a file on the server:
    • /data/loop-server/config/settings.json

Production

This environment is used for production and is the default server for Nightly.

The prod environment provide a Mobile number validation for the following countries:

Releasing loop-server

Ops procedure

Releasing loop-client

Please see Loop/Loop-client_Release_Process for the deployment and release process details.

Release Cycle

The service is continuously pushed into the dev server where client developers can test it.

The service is released in load-stage then production every other week (or asap if we discover a security breach)

  • Tuesday - end of previous cycle. tagging. pushed to load-stage
  • Tuesday through Friday - load testing by James on load-stage
  • Monday - push to production if no regression, if any regression backed off


The Tuesday release will be announced to the loop mailing list when the tagging is happening, so everyone has a chance to try it out.

Theorical dates for July/August:

  • July 8th - tagging
  • July 14th - prod push
  • July 22th - tagging
  • July 28th - prod push
  • August 5th - tagging
  • August 11th - prod push
  • August 19th - tagging
  • August 25th - prod push
  • and so forth...

Once the service will hit the Stable channel, we will introduce the new real-stage environment


Branches and bugfix deployments

In case of a bugfix:

  • A commit will with the fix will be pushed to master.
  • A new branch will be created on the github repository with the versions that needs the patch, and the fixes will be applied there (backported).
  • A new tag will be created with the new version (the patch version will be updated) and a deployment request will be filled.

For instance, in case the 0.9.0 release contains a bug that needs to be fixed:

  1. Fix the code in master;
  2. Backport (cherrypick) the commit in the 0.9.x branch (create it if needed);
  3. Tag a new minor release: 0.9.1 and fill a new deployment request.

Deployment Versioning

Loop-server is backward compatible and uses version routes (e.g. /v1, /v2) to segment changes in API. Currently this is controlled by the Loop-server and does not offer the ability to update v2 route code in isolation from v1. In other words, Fx34 browsers using /v1 route will have a production code change due to a bug fix in Fx35 /v2 API. Here's a breakdown of different approaches:

Option A: /v1 and /v2 routes point to different server clusters:

  • Pros:
    • Fx34 and /v1 users are not affected by any code change to /v2 api users.
    • Low risk of injecting bugs
  • Cons:
    • Ops has to maintain two server clusters (dev/stage/prod)
    • Both servers need to use the same database... this gets tricky.
    • Mo' computers mo' problems (complexity).

Option B: /v1 and /v2 routes are hosted on a single server:

  • Pros:
    • Simpler, less resources used and needed
    • Easier to use a single database
  • Cons
    • More risk when pushing code for /v2, /v1 code will be affected.

For Fx 34/35, we are choosing Option B. till we reach a point where have a history of injecting bugs due to this architecture.

Deploying flow

See full version at: https://old.etherpad-mozilla.org/deploy-release-process

How does a release get to production?

  • QA/DEV creates a stage deployment ticket and adds dependencies and blockers
    • (e.g. "Loop — Please deploy loop-server 0.13.0 to Stage)
  • DEV make a tag
    • Here we should try to make sure that the changelog has all Resolved/Fixed bugs going into this release.
  • OPS deploys build to stage
  • QA validates the fact that the release get deployed by OPS to stage
  • OPS set the stage bug to fixed as soon as it is deployed, yes this is fine
  • QA runs verification steps, quick tests, and loadtests
    • (after having set a window with partners)
  • QA set the bug to verified as soon as on as it is ok to deploy to Production
  • QA creates a deployment bug for production and add dependencies and blockers
  • OPS deploy the release to production and sets the bug to Resolved/Fixed
  • QA set the bug to verified as soon as deployment has been verified.
    • (This may include verification by the Loop client and QA teams)
  • OPS should be monitoring the release for a specific period of time
    • (to watch out for unforeseen issues and side-effects)

What do we do in the following cases?

  • A bug is found during the stage validation
    • DEV fix the issue and make a new minor release 0.13.1 and create a new deployment request bug (e.g. "Loop — Please deploy loop-server 0.13.1 to Stage)
      • (do we morph the existing one? jbonacci says no last time I did). So we close the current deployment bug and create a new one. ok.
    • OPS fix the issue and make a new minor release or a re-release of same build. We have had circumstances where the change is OPS-specific, not dev specific.
    • QA close the previous stage ticket as invalid and the story restarts with the new bug
      • I am pondering this idea for minor vs. major releases. One the one hand, having a history in the ticket (12.0, 12.1, 12.2) is good. On the other hand, the ticket can get to large (see Loop-Server 12.2)...
  • A bug is found in production
    • DEV fix the issue and make a new minor release from the production release (e.g 0.12.3)
    • DEV creates a stage bug (e.g. "Loop — Please deploy loop-server 0.12.3 to Stage). Well, QA should create the Stage ticket with information gathered from Dev. But either way works for me...
      • Then same story as an usual release

Who gives the green light when prod is ready to be updated?

For instance, lately we had a bug in production that happened while stage validation was passed by QA. In this case, it's a bit tricky to know if we should deploy to prodution ot not. In order to avoid things going wrong, should we wait for QA to give the green light again before pushing something new to production? Consider this can be blocking the resolution of a problem.

  • As soon as the stage ticket has been verified and that the production bug is created
    • Then OPS have a QA green light and can start the deployment.
    • Right. And issues specific to Production are a special case anyway. If tests pass in Stage but something goes run in Production, then we need to add the fix to both. If there is a Production-specific issue (that we would never see in Stage), then we should approach it on a case-by-case basis. There are cases where we have had to push something special/specific/urgent/break fix for other Production environments. It's not something we should consider "normal procedure" though, because it requires Stage and Prod to be out of sync.
  • There is the idea of a code change that always needs to go through this process. This is DEV driven.
  • Then, there is the idea of a service-level change that always needs to go through this process. This should be OPS driven.
  • Then, sometimes we have a real emergency in Production that requires a change (DEV or OPS). We have not always been good about the process for this case.

Examples:

  1. The service is broken and needs a code change
  2. Server issues like stack size, cpu/memory/disk issues, config issues, DB issues