ReleaseEngineering/Mozpool: Difference between revisions

No edit summary
m (→‎Release Engineering: scl1 -> scl3)
 
(28 intermediate revisions by 4 users not shown)
Line 1: Line 1:
Mobile development boards are imaged and re-imaged by loading a live Linux distro via PXE boot, then downloading the required images and writing them to the board's storage.  BMM's task is to automate that process.
= Overview =


= Architectural Description =
Mozilla needs to run its applications on various mobile devices, such as Tegras, Pandas, and even full smartphones.  These devices do not act much like the servers that fill the rest of Mozilla's datacenters: they have limited resources, no redundancy, and are comparatively unreliable.  With the advent of Firefox OS, Mozilla also needs the ability to automatically reinstall the entire OS on devices.
 
Mozpool is a system for managing these devices.  Users (automated or human) who need a device matching certain specifications can request one from Mozpool, and Mozpool will find such a device, installing a new operating system if necessary.  The middle layer of the system (Lifeguard) handles such reinstalls reliably, and also detects and investigates device failure, removing problematic devices from the pool.  System administrators can examine these failed devices and repair them, returning them to the pool.  The lowest level, Black Mobile Magic (BMM), handles low-level hardware details: automatic power control via IP-addressable power switches; a network-hosted Linux environment for performing software installations; and pinging, logging, and so forth.
 
Because continued operation of this system is business-critical, it is designed to be resilient to failure not only of individual devices, but to the servers running Mozpool itself.
 
= Policies and Procedures =
 
* [[ReleaseEngineering/Mozpool/Adding New Android Images to Mozpool]]
* [[ReleaseEngineering/Mozpool/Allocating Pandas Between Teams]]
* [[ReleaseEngineering/Mozpool/Handling Panda Failures]]
 
= Available Device Images =


Since the board firmware generally does not do PXE, each board's storage is initially set up with a boot loader (uboot, generally) that can do PXEThis boot loader is the same for all installs, so we can use simple duplication tools (e.g., an sdcard duplicator) to create them, and swap new cards into devices quickly.
; panda-android-4.0.4_v3.2
:  Added SUTAgent 1.20 to base image
; panda-android-4.0.4_v3.3
: Added [http://download.macromedia.com/pub/flashplayer/installers/archive/android/11.1.115.81/install_flash_player_ics.apk Adobe flash 11.1.115.81] to base image
; panda-android-4.0.4_v3.1
:  ''todo''
; android
:  ''todo''
; repair-boot
:  ''todo''
; b2g
:  ''obsolete''


== Overall Organization ==
= How-To's =


There is one imaging server per VLAN, and per current plans one VLAN per rack of mobile devices. Each imaging server runs an instance of the API, as well as tftpd and Apache to serve files, and rsyslog to handle logging.  Aside from its database connection, each imaging sever is independent of the others.  Most imaging traffic is local to the VLAN.
* [[ReleaseEngineering/Mozpool/How To Create a Panda Android Image Suitable For Mozpool]]
* [[ReleaseEngineering/Mozpool/How To Interpret Device State in Mozpool]]
* [[ReleaseEngineering/Mozpool/How To Use the Mozpool Web UI]] including such classic hits as
** How to request a device for a loan
** How to manually re-image a device
** How to control the power on a device
* [[ReleaseEngineering/Mozpool/How To Access the Mozpool API]]


== Booting ==
= Links =


On every boot, that boot loader gets PXE-related options from DHCP instructing it to load PXELINUX from its local imaging server. In normal operation, after PXELINUX is loaded and run, it downloads its default configuration which simply chains to the second bootloader on local storage, presumably running Android or B2G or whatever's installed.
* repositories - https://github.com/mozilla/mozpool and http://hg.mozilla.org/build/mozpool (synchronized by hand by developers)
* http://hg.mozilla.org/build/mozpool/file/default/README.md (version-controlled documentation)
* http://hg.mozilla.org/build/mozpool/file/default/API.txt (API documentation)
* http://hg.mozilla.org/build/mozpool/file/default/sql/schema.sql (DB Schema)
* Auto-Tools project pages
** [[Auto-tools/Projects/MozPool]]
** [[Auto-tools/Projects/Lifeguard]]
* PuppetAgain modules (installation details)
** [[ReleaseEngineering/PuppetAgain/Modules/bmm]]
** [[ReleaseEngineering/PuppetAgain/Modules/mozpool]]
* https://mana.mozilla.org/wiki/display/IT/Mozpool (employees only; IT-oriented details of the system implementation)


== Imaging ==
= Architectural Description =


If the board was scheduled to be re-imaged, then PXELINUX finds a board-specific configuration file (example) that directs it to load and boot a kernel, initrd, and squashfs from the imaging server.  Once those are booted, they run a shell script, mobile-init.sh. This script examines the kernel command line given in the PXELINUX configuration file and finds a URL for a second-stage script (example).  It downloads this script (generally from the imaging server) and executes it.
See http://hg.mozilla.org/build/mozpool/file/default/README.md for the most up-to-date architectural description of the system.


This second-stage script embodies the logic to do the reinstall.  For Android, it downloads the same image every time.  For B2G, it will consult the imaging server via the HTTP API (see below) to determine which binaries it should download.  There are a few second-stage maintenance scripts, as well, which don't download anything, but just run various diagnostics.
= Source =


From the imaging server's perspective, the key steps in the reimaging process are to set up the board-specific PXELINUX configuration (just a symlink) and reboot the board. In practice, the imaging servers continue to monitor the board's progress, with lots of timeouts and retries.
The source is at http://hg.mozilla.org/build/mozpool


== Power Control ==
= User Interface =


Power to each board can be controlled remotely via a simple, byte-oriented, TCP-based protocol.  The imaging servers use this capacity to reboot boards automatically.
The Mozpool user interface is available through a web browser.  The home page shows the three layers of the system (Mozpool, Lifeguard, and BMM).  Clicking on any of those shows a UI specific to the layer.  The BMM UI allows direct control of device power, as well as manual PXE booting; this layer is of most interest to datacenter operations staff.  The lifeguard layer allows managed PXE boots and power cycles, as well as forced state transitions.


== API ==
= Deployment =


Each imaging server presents an HTTP API that allows the following:
Mozpool is a Python daemon that runs on multiple imaging servers.  It uses a database backend and HTTP API for communication between servers.  Its frontend is a dynamic web application.  The BMM equipment - TFTP servers, syslog daemons, and so on - runs on the same systems.


* query the status of boards and images (the latter being PXE configs)
Mozpool is designed to be deployed in multiple "pools" within Mozilla.  The first and likely largest is release engineering.
* request a board be rebooted
* request a board be reimaged
* indicate a state change of a board - this is called from various scripts during the reimaging process so that the imaging server can track the board's state


See http://hg.mozilla.org/build/bmm/file/tip/API.txt
== Release Engineering ==


== Logging ==
In the scl3 datacenter, we have an initial deployment of 10 racks of Pandaboards.  Each rack holds about 80 Pandas, grouped in custom-built chassis, for a total of about 800 pandas.  Each rack also contains seven "foopies" (proxying between pandas and Buildbot) and one imaging server.  Each rack has a dedicated VLAN, keeping most network traffic local to the rack.  The database backend is MySQL.  See the puppet modules, linked above, for more details of the deployment.


Boards send log output to the imaging server during the imaging process using syslogThe imaging server relays that information into the database for debugging purposes.
At the BMM and Lifeguard levels, each imaging server is responsible for the pandas in its rack, as assigned in inventoryAt the Mozpool level, each imaging server is responsible for all requests that were initiated locally.  Mozpool uses HTTP to communicate with Lifeguard on other imaging servers when it needs to reserve a non-local device.


== Inventory Sync ==
= Mozpool Client =
In Release Engineering we use the mozpool client to talk with the Mozpool servers to request panda boards.
To do this we install the python package inside of a virtual environment.
The package is stored in pypi:
* http://pypi.pvt.build.mozilla.org/pub/
* http://pypi.pub.build.mozilla.org/pub/


The imaging servers determine which boards they are responsible for by searching the Mozilla system inventory. Every 30m, the "admin" node runs a crontask to synchronize the imaging service database to the contents of inventory. The following keys are required for every managed board:
To create a new packaged version, checkout the mozpool repo and do the following:
# Make your code changes
# Update the version in [http://hg.mozilla.org/build/mozpool/file/default/mozpoolclient/setup.py#l5 setup.py]
# Add a new line to [http://hg.mozilla.org/build/mozpool/file/default/mozpoolclient/CHANGES.txt CHANGES.txt] with the new version, the date and what is changing
# cd mozpoolclient && python setup.py sdist


* nic.0.mac_address.0 – the board's MAC address
To deploy to our pypi setup [[ReleaseEngineering:Buildduty:Other_Duties#Python_packages|follow these instructions]].
* system.imaging_server.0 – the fqdn of the imaging server in the same rack as this board
* system.relay.0 – board's relay information, in the form fqdn:bank:relay


== State Machine ==
There is also a "fork" of the client code that lives in the tools repo: http://hg.mozilla.org/build/tools/lib/python/vendor/mozpoolclient-0.1.6


Because development boards are not even vaguely reliable, the reimaging process has lots of timeouts and performs lots of retriesThis is implemented as a state machine, with the state stored in the databaseThis state machine implements all of the timeouts and retires, and does so in such a way that it will pick up where it left off after a failure or restart of the imaging server itself.
To update this version run the following commands:
OLD=0.1.5
NEW=0.1.6
cd tools/lib/python/vendor
hg move mozpoolclient-${OLD} mozpoolclient-${NEW}
  # Assuming mozpool is checked out at the same level as your tools repo.
rsync --recursive --delete ../../../../mozpool/mozpoolclient/* mozpoolclient-${NEW}
#Bump the version in here http://mxr.mozilla.org/build/source/tools/lib/python/vendorlibs.pth
  vi ../vendorlibs.pth
hg commit -m"Bumping mozpool client vendor version from ${OLD} to ${NEW}"
hg push


See [[ReleaseEngineering/BlackMobileMagic/State Machine]]
'''NOTE''': if you're making API changes to the mozpool client, you'll need to update the consumers in the tools repo as well before committing.


= Source =
If you're the pypi package maintainer (armenzg or dustin), you can follow these [??? instructions].
The source is at http://hg.mozilla.org/build/bmm

Latest revision as of 20:41, 10 September 2015

Overview

Mozilla needs to run its applications on various mobile devices, such as Tegras, Pandas, and even full smartphones. These devices do not act much like the servers that fill the rest of Mozilla's datacenters: they have limited resources, no redundancy, and are comparatively unreliable. With the advent of Firefox OS, Mozilla also needs the ability to automatically reinstall the entire OS on devices.

Mozpool is a system for managing these devices. Users (automated or human) who need a device matching certain specifications can request one from Mozpool, and Mozpool will find such a device, installing a new operating system if necessary. The middle layer of the system (Lifeguard) handles such reinstalls reliably, and also detects and investigates device failure, removing problematic devices from the pool. System administrators can examine these failed devices and repair them, returning them to the pool. The lowest level, Black Mobile Magic (BMM), handles low-level hardware details: automatic power control via IP-addressable power switches; a network-hosted Linux environment for performing software installations; and pinging, logging, and so forth.

Because continued operation of this system is business-critical, it is designed to be resilient to failure not only of individual devices, but to the servers running Mozpool itself.

Policies and Procedures

Available Device Images

panda-android-4.0.4_v3.2
Added SUTAgent 1.20 to base image
panda-android-4.0.4_v3.3
Added Adobe flash 11.1.115.81 to base image
panda-android-4.0.4_v3.1
todo
android
todo
repair-boot
todo
b2g
obsolete

How-To's

Links

Architectural Description

See http://hg.mozilla.org/build/mozpool/file/default/README.md for the most up-to-date architectural description of the system.

Source

The source is at http://hg.mozilla.org/build/mozpool

User Interface

The Mozpool user interface is available through a web browser. The home page shows the three layers of the system (Mozpool, Lifeguard, and BMM). Clicking on any of those shows a UI specific to the layer. The BMM UI allows direct control of device power, as well as manual PXE booting; this layer is of most interest to datacenter operations staff. The lifeguard layer allows managed PXE boots and power cycles, as well as forced state transitions.

Deployment

Mozpool is a Python daemon that runs on multiple imaging servers. It uses a database backend and HTTP API for communication between servers. Its frontend is a dynamic web application. The BMM equipment - TFTP servers, syslog daemons, and so on - runs on the same systems.

Mozpool is designed to be deployed in multiple "pools" within Mozilla. The first and likely largest is release engineering.

Release Engineering

In the scl3 datacenter, we have an initial deployment of 10 racks of Pandaboards. Each rack holds about 80 Pandas, grouped in custom-built chassis, for a total of about 800 pandas. Each rack also contains seven "foopies" (proxying between pandas and Buildbot) and one imaging server. Each rack has a dedicated VLAN, keeping most network traffic local to the rack. The database backend is MySQL. See the puppet modules, linked above, for more details of the deployment.

At the BMM and Lifeguard levels, each imaging server is responsible for the pandas in its rack, as assigned in inventory. At the Mozpool level, each imaging server is responsible for all requests that were initiated locally. Mozpool uses HTTP to communicate with Lifeguard on other imaging servers when it needs to reserve a non-local device.

Mozpool Client

In Release Engineering we use the mozpool client to talk with the Mozpool servers to request panda boards. To do this we install the python package inside of a virtual environment. The package is stored in pypi:

To create a new packaged version, checkout the mozpool repo and do the following:

  1. Make your code changes
  2. Update the version in setup.py
  3. Add a new line to CHANGES.txt with the new version, the date and what is changing
  4. cd mozpoolclient && python setup.py sdist

To deploy to our pypi setup follow these instructions.

There is also a "fork" of the client code that lives in the tools repo: http://hg.mozilla.org/build/tools/lib/python/vendor/mozpoolclient-0.1.6

To update this version run the following commands:

OLD=0.1.5
NEW=0.1.6
cd tools/lib/python/vendor
hg move mozpoolclient-${OLD} mozpoolclient-${NEW}
# Assuming mozpool is checked out at the same level as your tools repo.
rsync --recursive --delete ../../../../mozpool/mozpoolclient/* mozpoolclient-${NEW}
#Bump the version in here http://mxr.mozilla.org/build/source/tools/lib/python/vendorlibs.pth
vi ../vendorlibs.pth
hg commit -m"Bumping mozpool client vendor version from ${OLD} to ${NEW}"
hg push

NOTE: if you're making API changes to the mozpool client, you'll need to update the consumers in the tools repo as well before committing.

If you're the pypi package maintainer (armenzg or dustin), you can follow these [??? instructions].