ReleaseEngineering/Mozpool

From MozillaWiki
Jump to navigation Jump to search

Mobile development boards are imaged and re-imaged by loading a live Linux distro via PXE boot, then downloading the required images and writing them to the board's storage. BMM's task is to automate that process.

Architectural Description

Since the board firmware generally does not do PXE, each board's storage is initially set up with a boot loader (uboot, generally) that can do PXE. This boot loader is the same for all installs, so we can use simple duplication tools (e.g., an sdcard duplicator) to create them, and swap new cards into devices quickly.

Overall Organization

There is one imaging server per VLAN, and per current plans one VLAN per rack of mobile devices. Each imaging server runs an instance of the API, as well as tftpd and Apache to serve files, and rsyslog to handle logging. Aside from its database connection, each imaging sever is independent of the others. Most imaging traffic is local to the VLAN.

Booting

On every boot, that boot loader gets PXE-related options from DHCP instructing it to load PXELINUX from its local imaging server. In normal operation, after PXELINUX is loaded and run, it downloads its default configuration which simply chains to the second bootloader on local storage, presumably running Android or B2G or whatever's installed.

Imaging

If the board was scheduled to be re-imaged, then PXELINUX finds a board-specific configuration file (example) that directs it to load and boot a kernel, initrd, and squashfs from the imaging server. Once those are booted, they run a shell script, mobile-init.sh. This script examines the kernel command line given in the PXELINUX configuration file and finds a URL for a second-stage script (example). It downloads this script (generally from the imaging server) and executes it.

This second-stage script embodies the logic to do the reinstall. For Android, it downloads the same image every time. For B2G, it will consult the imaging server via the HTTP API (see below) to determine which binaries it should download. There are a few second-stage maintenance scripts, as well, which don't download anything, but just run various diagnostics.

From the imaging server's perspective, the key steps in the reimaging process are to set up the board-specific PXELINUX configuration (just a symlink) and reboot the board. In practice, the imaging servers continue to monitor the board's progress, with lots of timeouts and retries.

Power Control

Power to each board can be controlled remotely via a simple, byte-oriented, TCP-based protocol. The imaging servers use this capacity to reboot boards automatically.

API

Each imaging server presents an HTTP API that allows the following:

  • query the status of boards and images (the latter being PXE configs)
  • request a board be rebooted
  • request a board be reimaged
  • indicate a state change of a board - this is called from various scripts during the reimaging process so that the imaging server can track the board's state

See http://hg.mozilla.org/build/bmm/file/tip/API.txt

Logging

Boards send log output to the imaging server during the imaging process using syslog. The imaging server relays that information into the database for debugging purposes.

Inventory Sync

The imaging servers determine which boards they are responsible for by searching the Mozilla system inventory. Every 30m, the "admin" node runs a crontask to synchronize the imaging service database to the contents of inventory. The following keys are required for every managed board:

  • nic.0.mac_address.0 – the board's MAC address
  • system.imaging_server.0 – the fqdn of the imaging server in the same rack as this board
  • system.relay.0 – board's relay information, in the form fqdn:bank:relay

State Machine

Because development boards are not even vaguely reliable, the reimaging process has lots of timeouts and performs lots of retries. This is implemented as a state machine, with the state stored in the database. This state machine implements all of the timeouts and retires, and does so in such a way that it will pick up where it left off after a failure or restart of the imaging server itself.

See ReleaseEngineering/BlackMobileMagic/State Machine

Source

The source is at http://hg.mozilla.org/build/bmm