IT/Maintenance/Quarterly

From MozillaWiki
< IT
Jump to: navigation, search

Introduction

This document describes the Quarterly process involved in upgrading Red Hat servers managed by Mozilla IT. This document will also apply to the firmware upgrades that require a system reboot (BIOS and RAID controller firmware upgrades).

Mozilla Operations Center (MOC) will be responsible for coordinating & enforcing the quarterly upgrades.

We will refer to “full system upgrades” (that is, yum update), unless otherwise mentioned.

Note: Some group of servers require special attention and a full system upgrade is not recommended.

Communications & Notifications

Mozilla MOC is responsible for communications and notifications and will use the existing process.

Procedure

Preparations, as defined in this document, should start no later than the first two weeks of a new quarter. The desired goal is to have all the systems upgraded within the first 8 weeks at maximum of the quarter.

Candidate List

Mozilla's inventory ( https://inventory.mozilla.org ) will be used to generated a list of candidate servers. The list of servers will be generated within the first two weeks of each quarter.

Systems that are marked in the inventory with the following status will be candidates for upgrades:

  • production
  • building

Upgrade Process

Generally upgrades will take place on dev/staging systems first with sufficient post upgrade review time to verify the upgrades have not affected stability or performance.

Clusters

Cluster upgrades will be rolled out in chunks (~20% at a time) unless otherwise impossible. This will allow for careful review of stability and performance before upgrading additional hosts.

  • Verify package upgrades
    • run yum update & work with system owners to verify list to prevent accidental upgades.
  • If there are conflicting upgrades, add to puppet to exclude from /etc/yum.conf

Kernel Upgrades

  1. Consult with Security Assurance on dependent packages to be upgraded.
  2. Alert Security Assurance if there are systems where Kernel upgrade is not possible and work out contingency plan as necessary.