Balrog

From MozillaWiki
(Redirected from Balrog/Testing)
Jump to: navigation, search

If you are looking for the general documentation that used to live here, it has been moved into the Balrog repository, and a built version of it is available on Read The Docs.

This page will continue to host information about Balrog that doesn't make sense to put into the repository, such as meeting notes and things related to our hosted versions of Balrog.

Infrastructure

Domains

Balrog admin interface is accessible at aus4-admin.mozilla.org (vpn required).

The public interface that Firefox and other applications talk to is at aus5.mozilla.org. Many older update domains are also served by Balrog, including aus4.mozilla.org, aus3.mozilla.org, and aus2.mozilla.org. More details on these can be found on the Client Domains page.

Support & Escalation

If the issue may be visible to users, please make sure #moc is also notified. They can also assist with the notifications below.

RelEng is the first point of contact for issues. To contact them, follow the standard RelEng escalation path.

If RelEng is unable to correct the issue, they may escalate to CloudOps.

Monitoring & Metrics

Metrics from RDS, EC2, and Nginx are available in the Datadog Dashboard.

We aggregate exceptions from both the public apps and admin app to CloudOps' Sentry instance.

Change Notification

Changes made to Rules, Scheduled Rule Changes, Permissions, or the read-only flag of a Release will send e-mail notification to the balrog-db-changes mailing list. This serves us an alert system - if we see changes made that we weren't expecting, we can go investigate them.

ELB Logs

NOTE: These instructions were written before Amazon Athena existed. The next time we need to do such analysis, it's probably worth giving it a try.

The ELB logs for the public-facing application are replicated to the balrog-us-west-2-elb-logs S3 bucket, located in us-west-2. Logs are rotated very quickly, and we end up with tens of thousands of separate files each day. Because of this, and the fact that S3 has a lot of overhead per-file, it can be tricky to do analysis on them. You're unlikely to be able to download the logs locally in any reasonable amount of time (ie, less than a day), but mounting them on an EC2 instance in us-west-2 should provide you with reasonably quick access. Here's an example:

  • Launch EC2 instance (you probably a compute-optimized one, and at least 100GB of storage).
  • Generate an access token for your CloudOps AWS account. If you don't have a CloudOps AWS account, talk to Ben Hearsum or Bensong Wong. Put the token in a plaintext file somewher on the instance.
    • If you've chosen local storage, you'll probably need to format and mount volume.
  • Install s3fs by following the instructions on https://github.com/s3fs-fuse/s3fs-fuse.
  • Mount the bucket on your instance, eg:
s3fs balrog-us-west-2-elb-logs /media/bucket -o passwd_file=pw.txt
  • Do some broad grepping directly on the S3 logs, and store it in a local file. This should speed up subsequent queries. Eg:
grep '/Firefox/.*WINNT.*/release/' /media/bucket/AWSLogs/361527076523/elasticloadbalancing/us-west-2/2016/09/17/* | gzip > /media/ephemeral0/sept-17-winnt-release.txt.gz
  • Do additional queries on the new logfile.

Backups

Balrog uses the built-in RDS backups. The database in snapshotted nightly, and incremental backups are done throughout the day. If necessary, we have the ability to recover to within a 5 minute window. Database restoration is done by CloudOps, and they should be contacted immediately if needed.

Deploying Changes

Balrog's stage and production infrastructure is managed by the Cloud Operations team. This section describes how to go from a reviewed patch to deploying it in production.

To ensure there's adequate time for stage deployment and testing, you should generally begin this process at least 24 hours before you want the new code live in production.

Is now a good time?

Before you deploy, consider whether or not it's an appropriate time to. Some factors to consider:

  • Are we in the middle of an important release such as a chemspill? If so, it's probably not a good time to deploy.
  • How risky are your changes? If they're high risk, deploying on a Friday is probably a bad idea.
  • Do you need to migrate any data? If you do, make sure you have time to do so right after deploying.
  • Do you have enough time to safely do a push? Most pushes take at most 60 minutes to complete after the stage push has been done. This time is mostly affected by how long it takes you to verify your changes in stage and production.

Landing

Schema Upgrades

If you need to do a schema change you must ensure that either the current production code can run with your schema change applied, or that your new code can run with the old schema. Code and schema changes cannot be done at the same instant, so you must be able to support one of these scenarios. Generally, additive changes (column or table additions) should do the schema change first, while destructive changes (column or table deletions) should do the schema change second. You can simulate the upgrade with your local Docker containers to verify which is right for you.

When you file the deployment bug (see below), include a note about the schema change in it. Something like:

This push requires a schema change that needs to be done _prior_ to the new code going out. That can be performed by running the Docker image with the "upgrade-db" command, with DBURI set.

bug 1295678 is an example of a push with a schema change.

Testing

Before asking for a push, you should do some functional testing on your local machine with the Docker images. You should do this against the master branch of the upstream repository to ensure you're testing the exact code that is to be deployed. At the very least, you should do explicit testing of all the new code that would be included in the push. Eg: if you're changing the format of a blob, make sure that you can add a new blob of that type, and that the XML response looks correct.

Pushing to stage and production

Pushing live is a two step process. First, you must push to the stage environment and ensure things are working there. Then, you can push live.

  1. Bump the in-repo version.
  2. Tag the repository with a "vX.Y" tag. Eg: "git tag -s vX.Y"
  3. Wait for CI jobs to complete. Unit tests must pass and a new Docker Image for the webapps and the Agent must be pushed to Dockerhub before you proceed.
  4. File a bug to have the new version pushed to stage. Be sure to include the new version number, and Docker image tags for the webapps and the Agent you want deployed.
    • This bug should generally be filed a about 24h in advance of the desired production push time to give adequate time for the stage deployment and testing.
    • Wednesdays are usually the best day to push to production, because they are generally free of release events. Unless you have a specific need to deploy on a different day, you should request the prod push for a Wednesday.
    • You should link any bugs being deployed is the "Blocks" field.
  5. Once stage has been updated, verify your changes again. Even though you've tested locally, it's important to retest in stage to make sure there's no deployment-specific issues.
  6. When stage looks good, you're ready to comment in the bug to ask for production to be updated.
  7. Reverify in production. When production has been updated, verify your changes again there. If you need to tweak rules or releases to do so, be careful not to touch any live channels (create new rules or releases if necessary). This final verification is as more about making sure the right thing got deployed than whether or not your code is correct.

Meeting Notes