Changes

Jump to: navigation, search

Balrog

15,417 bytes removed, 14:49, 9 January 2020
Replaced content with "Moved to https://mozilla-balrog.readthedocs.io/en/latest/infrastructure.html"
If you are looking for the general documentation that used Moved to live here, it has been moved into [https://github.com/mozilla/balrog the Balrog repository], and a built version of it [http://mozilla-balrog.readthedocs.io/en/latest/indexinfrastructure.html is available on Read The Docs]. This page will continue to host information about Balrog that doesn't make sense to put into the repository, such as meeting notes and things related to our hosted versions of Balrog. Balrog is the software that runs the server side component of the update system used by Firefox and other Mozilla products. It is the successor to AUS (Application Update Service), which did not scale to our current needs nor allow us to adapt to more recent business requirements. Balrog helps us ship updates faster and with much more flexibility than we’ve had in the past. = Infrastructure === Environments ==We have a number of different Balrog environments with different purposes:{| class="wikitable"|-! Environment! App! URL! Deploys! Purpose|-| rowspan="2" | Production| Admin| https://aus4-admin.mozilla.org (VPN Required)| rowspan="2" | Manually by CloudOps| rowspan="2" | Manage and serve production updates|-| Public| https://aus5.mozilla.org and others (see the [[Balrog/Client_Domains | Client Domains page]] for details)|-| rowspan="2" | Stage| Admin| https://admin-stage.balrog.nonprod.cloudops.mozgcp.net/ (VPN Required)https://balrog-admin-static-stage.stage.mozaws.net/ (VPN Required)| rowspan="2" | When version tags (eg: v2.40) are created | rowspan="2" | A place to submit staging Releases and verify new Balrog code with automation|-| Public| https://stage.balrog.nonprod.cloudops.mozgcp.net/|-| rowspan="2" | Dev| Admin| https://admin-dev.balrog.nonprod.cloudops.mozgcp.net/ (VPN Required)| rowspan="2" | Whenever new code is pushed to Balrog's master branch| rowspan="2" | Manual verification of Balrog code changes in a deployed environment|-| Public| https://dev.balrog.nonprod.cloudops.mozgcp.net/|} The database for the stage or dev environments can be rebuilt (using the latest production data dump) via the <tt>/__rebuilddb__</tt> endpoint on the respective admin host. == Support & Escalation ==''If the issue may be visible to users, please make sure [irc://irc.mozilla.org/#moc #moc] is also notified. They can also assist with the notifications below.'' RelEng is the first point of contact for issues. To contact them, follow [[ReleaseEngineering#Contacting_Release_Engineering|the standard RelEng escalation path]]. If RelEng is unable to correct the issue, they may [https://mana.mozilla.org/wiki/display/SVCOPS/Contacting+Cloud+Operations escalate to CloudOps]. == Monitoring & Metrics ==Metrics from RDS, EC2, and Nginx are available in the [https://app.datadoghq.com/dash/156924/balrog-web-aus5mozillaorg?live=true&page=0&is_auto=false&tile_size=m&fullscreen=false Datadog Dashboard]. We aggregate exceptions from both the [https://sentry.prod.mozaws.net/operations/prod-public/ public apps] and [https://sentry.prod.mozaws.net/operations/prod-admin/ admin app] to [https://sentry.prod.mozaws.net/operations/ CloudOps' Sentry instance]. == ELB Logs ==Balrog publishes logs to S3 buckets which are [https://sql.telemetry.mozilla.org available for querying in Redash]. The relevant tables are:* balrog_elb_logs_aus{3,4,5} - These tables contain update request records sourced from the ELB logs of the named domain (eg: aus5). If you're looking to do ad-hoc queries of update request (eg: estimate how many users are on a particular version or channel), the balrog_elb_logs_aus5 is probably what you want to query.* balrog_elb_logs_aus_api - This table contains request logs for the aus-api.mozilla.org domain* log_balrog_admin_nginx_access - This table contains access logs for the admin app sourced from nginx access logs.* log_balrog_admin_nginx_error - This table contains error logs for the admin app sourced from nginx error logs.* log_balrog_admin_syslog_admin_fixed - This table contains syslog output from the admin app's Docker container.* log_balrog_admin_syslog_agent - This table contains syslog output from the agent's Docker container.* log_balrog_web_syslog_web_fixed - This table contains syslog output from the public app's Docker containers. Redash should show you the table schemas in the pane on the left. If not, you can inspect with them with "describe $table". == Backups ==Balrog uses the built-in RDS backups. The database in snapshotted nightly, and incremental backups are done throughout the day. If necessary, we have the ability to recover to within a 5 minute window. Database restoration is done by CloudOps, and they should be contacted immediately if needed. == Deploying Changes ==Balrog's [https://mana.mozilla.org/wiki/display/SVCOPS/Firefox+Update+Service stage and production infrastructure] is managed by [https://mana.mozilla.org/wiki/display/SVCOPS/Contacting+Cloud+Operations the Cloud Operations team]. This section describes how to go from a reviewed patch to deploying it in production. You should generally begin this process '''at least 24 hours''' before you want the new code live in production. This gives the new code a chance to bake in stage. At a high level, the deployment process looks like this:* Verify the new code in dev* Bake the new code in stage* Deploy to prod Each part of this process is described in more detail below. === Is now a good time? ===Before you deploy, consider whether or not it's an appropriate time to. Some factors to consider:* Are we in the middle of an important release such as a chemspill? If so, it's probably not a good time to deploy.* Is it Friday? You probably don't want to deploy on a Friday except in extreme circumstances.* Do you have enough time to safely do a push? Most pushes take at most 60 minutes to complete once the production push has begun. === Schema Upgrades ===If you need to do a schema change you '''must''' ensure that either the current production code can run with your schema change applied, or that your new code can run with the old schema. Code and schema changes '''cannot''' be done at the same instant, so you must be able to support one of these scenarios. Generally, additive changes (column or table additions) should do the schema change first, while destructive changes (column or table deletions) should do the schema change second. You can simulate the upgrade with your local Docker containers to verify which is right for you. A quick way to find out if you have a schema change is to diff the current tip of the master branch against the currently deployed tag, eg: tag=REPLACEME git diff $tag When you file the deployment bug (see below), include a note about the schema change in it. Something like: This push requires a schema change that needs to be done _prior_ to the new code going out. That can be performed by running the Docker image with the "upgrade-db" command, with DBURI set. {{bug|1295678}} is an example of a push with a schema change. === Verification in dev ===The dev environment automatically deploys new code from the master branch of [https://github.com/mozilla/balrog the Balrog repository] (including any necessary schema changes). Before beginning the deployment procedure, you should do some functional testing there. At the very least, you should do explicit testing of all the new code that would be included in the push. Eg: if you're changing the format of a blob, make sure that you can add a new blob of that type, and that the XML response looks correct. '''If you have schema changes''' you must also ensure that the existing deployed code will work with the new schema. To do this, CloudOps will downgrade the dev apps. You should do some routine testing (make some changes to some objects, try some update requests) to ensure that everything works. If you have any issues you CANNOT proceed to production. === Baking in stage ===To get the new code in stage you must create a new Release in Github as follows:# Tag the repository with a "vX.Y" tag. Eg: "git tag -s vX.Y && git push --tags"# Diff against the previous release tag. Eg: "git diff v2.24 v2.25", to double whether or not there's schema changes.#* Look for anything unexpected, or any '''schema changes'''. If schema changes are present, see the above section for instructions on handling them.# [https://github.com/mozilla/balrog/releases Create a new Release on Github]. This create new Docker images tagged with your version, and deploys them to stage. It may take upwards of 30 minutes for the deployment to happen. Once the changes are deployed to stage, let them bake for at least 24 hours. You can do additional targeted testing here if you wish, or simply wait for nightlies/releases to prod things along. It's a good idea to watch Sentry for new exceptions that may show up, and Datadog for any notable changes in the shape of the traffic. '''Important Note!''' Only two-part version numbers (like shown above) are supported by our deployment pipeline. === Pushing to production ===Pushing live requires CloudOps. For non-urgent pushes, you should begin this procedure a few hours in advance to give CloudOps time to notice and respond. For urgent pushes, file the bug immediately and [https://mana.mozilla.org/wiki/display/SVCOPS/Contacting+Cloud+Operations escalate if no action is taken quickly]. Either way, you must follow this procedure to push:# [https://bugzilla.mozilla.org/enter_bug.cgi?assigned_to=oremj%40mozilla.com&bug_file_loc=http%3A%2F%2F&bug_ignored=0&bug_severity=normal&bug_status=NEW&cc=oremj%40mozilla.com&cc=jbuckley%40mozilla.com&cc=bhearsum%40mozilla.com&cf_blocking_fennec=---&cf_fx_iteration=---&cf_fx_points=---&cf_status_firefox56=---&cf_status_firefox57=---&cf_status_firefox58=---&cf_status_firefox_esr52=---&cf_tracking_firefox56=---&cf_tracking_firefox57=---&cf_tracking_firefox58=---&cf_tracking_firefox_esr52=---&cf_tracking_firefox_relnote=---&comment=Balrog%20version%20X.Y%20is%20ready%20to%20be%20pushed%20to%20prod.%20Please%20deploy%20the%20new%20Docker%20images%20%28vX.Y%29%20for%20admin%2C%20public%2C%20and%20the%20agent.%0D%0A%0D%0AWe%27d%20like%20the%20production%20push%20for%20this%20to%20happen%20around%2011am%20pacific%20on%20%28DATE%20GOES%20HERE%29.%0D%0A%0D%0AONE%20OF%3A%0D%0A%28NO%20SCHEMA%20CHANGE%29%20This%20release%20does%20not%20contain%20a%20schema%20change.%0D%0A%28ADDITIVE%20SCHEMA%20CHANGE%29%20This%20release%20requires%20a%20schema%20change%20that%20needs%20to%20be%20done%20_prior_%20to%20the%20new%20code%20going%20out.%20It%20can%20be%20performed%20by%20running%20the%20Docker%20image%20with%20the%20%22upgrade-db%22%20command%2C%20with%20DBURI%20set.%0D%0A%28DESTRUCTIVE%20SCHEMA%20CHANGE%29%20This%20release%20requires%20a%20schema%20change%20that%20needs%20to%20be%20done%20_after_%20to%20the%20new%20code%20is%20fully%20deployed.%20It%20can%20be%20performed%20by%20running%20the%20Docker%20image%20with%20the%20%22upgrade-db%22%20command%2C%20with%20DBURI%20set.%0D%0A%0D%0AONE%20OF%3A%0D%0A%28NO%20SCHEMA%20CHANGE%20OR%20ADDITIVE%20SCHEMA%20CHANGE%29%20If%20anything%20goes%20wrong%20with%20the%20new%20version%20in%20production%2C%20we%20can%20safely%20rollback.%0D%0A%28DESTRUCTIVE%20SCHEMA%20CHANGE%29%20If%20anything%20goes%20wrong%20with%20the%20new%20version%20in%20production%2C%20we%20must%20revert%20the%20schema%20change%20prior%20to%20rolling%20any%20code%20back%20by%20running%20the%20following%20command%20in%20the%20admin%20container%3A%0D%0Apython%20%2Fapp%2Fscripts%2Fmanage-db.py%20-d%20%24DBURI%20--version%20X%0D%0A%28UNABLE%20TO%20COME%20UP%20WITH%20SAFE%20ROLLBACK%20PROCEDURE%20-%20THIS%20SHOULD%20NEVER%20HAPPEN%29%20This%20push%20contains%20changes%20that%20cannot%20be%20safely%20rolled%20back.%20If%20anything%20goes%20wrong%20with%20the%20new%20version%20in%20production%2C%20please%20escalate%20to%20me.%0D%0A%0D%0AONE%20OF%3A%0D%0A%28NO%20NEW%20INTERDEPENDENT%20CODE%20BETWEEN%20admin%2Fpublic%2Fagent%29%20If%20anything%20goes%20wrong%20with%20one%20of%20the%20apps%2C%20it%20may%20be%20rolled%20back%20independently%20of%20the%20others.%0D%0A%28NEW%20CODE%20REQUIRES%20admin%2Fpublic%2Fagent%20ON%20THE%20MOST%20RECENT%20VERSION%29%20If%20anything%20goes%20wrong%20with%20just%20one%20of%20the%20apps%2C%20all%20of%20them%20must%20be%20rolled%20back.&component=Operations%3A%20Deployment%20Requests&contenttypemethod=autodetect&contenttypeselection=text%2Fplain&defined_groups=1&flag_type-37=X&flag_type-4=X&flag_type-5=X&flag_type-607=X&flag_type-708=X&flag_type-721=X&flag_type-737=X&flag_type-787=X&flag_type-800=X&flag_type-803=X&flag_type-846=X&flag_type-864=X&flag_type-914=X&flag_type-916=X&form_name=enter_bug&maketemplate=Remember%20values%20as%20bookmarkable%20template&op_sys=Unspecified&priority=--&product=Cloud%20Services&rep_platform=Unspecified&short_desc=please%20deploy%20balrog%20X.Y%20to%20prod&target_milestone=---&version=unspecified File a bug] to have the new version pushed to production.#* Wednesdays around 11am Pacific are usually the best day to push to production, because they are generally free of release events, nightlies, and cronjobs. Unless you have a specific need to deploy on a different day, you should request the prod push for a Wednesday between those hours#* You should link any bugs being deployed is the "Blocks" field.#* Make sure you substitute the version number and choose the correct options from the bug template.# Once the push has happened, verify that the code was pushed to production by checking the __version__ endpoints on [https://aus4-admin.mozilla.org/__version__ the Admin] and [https://aus5.mozilla.org/__version__ Public] apps.# Bump the [https://github.com/mozilla/balrog/blob/master/version.txt in-repo version] to the next available one to ensure the next push gets a new version. = Meeting Notes =* [[Balrog/Meetings/CloudOps - June 22, 2016 | CloudOps Migration Meeting - June 22, 2016]]* [[Balrog/Meetings/Balrog+absearch Information Exchange | Balrog/absearch Information Exchange - June 28, 2016]]* [[Balrog/Meetings/CloudOps - June 29, 2016 | CloudOps Migration Meeting - June 29, 2016]]* [[Balrog/Meetings/CloudOps - July 6, 2016 | CloudOps Migration Meeting - July 6, 2016]]* [[Balrog/Meetings/CloudOps - July 12, 2016 | CloudOps Final Cut Over Planning - July 12, 2016]]* [[Balrog/Meetings/Balrog Worker Brainingstorming - July 19, 2016 | Balrog Worker Brainstorming - July 19, 2016]]* [[Balrog/Meetings/CloudOps - July 27, 2016 | CloudOps Meeting - July 27, 2016]]* [[Balrog/Meetings/CloudOps - August 10, 2016 | CloudOps Meeting - August 10, 2016]]* [[Balrog/Meetings/CloudOps - September 7, 2016 | CloudOps Meeting - September 7, 2016]]* [[Balrog/Meetings/CloudOps - September 14, 2016 | CloudOps Meeting - September 14, 2016]]* [[Balrog/Meetings/CloudOps - September 21, 2016 | CloudOps Meeting - September 21, 2016]]* [[Balrog/Meetings/CloudOps - September 28, 2016 | CloudOps Meeting - September 28, 2016]]* [[Balrog/Meetings/CloudOps - October 19, 2016 | CloudOps Meeting - October 19, 2016]]* [[Balrog/Meetings/CloudOps - November 2, 2016 | CloudOps Meeting - November 2, 2016]]* [[Balrog/Meetings/CloudOps - November 9, 2016 | CloudOps Meeting - November 9, 2016]]* [[Balrog/Meetings/CloudOps - January 18, 2017 | CloudOps Meeting - January 17, 2017]]* [[Balrog/Meetings/CloudOps - February 22, 2017 | CloudOps Meeting - February 22, 2017]]
Canmove, confirm
6,438
edits

Navigation menu