The main focus of this page is to collect information around running *tests*, how to have a good comprehension, good metrics and determine the efficiency of the system.

Information about jobs

Non-running-tests wall time:

machine reboot time (if applicable)
runner (if applicable)
buildslave connecting to master assigning job
buildbot steps besides mozharness call
buildbot steps lag (due to master lagginess)
mozharness non-running-tests actions
- clobber
- download-and-extract
- checkout

About reboots

We always reboot on Windows testers since runner isn't managing all the processes there. We also reboot after any android, emulator, mochitests or reftests, since those change the system state in ways we haven't been able to identify...the only way to get back to a known good state is to reboot.

Reboots can happen as part of runner (post mozharness run) or as part of mozharness.

From job to job

Finish job --> reboot (mh or runner) --> bootup --> runner does pre-flight checks, including running clobber --> starts buildbot

We do know when machines run buildbot again if runner is in place since it reports to influxdb.

NOTE: bootup !== aws slave spinup time

Known bugs

We are currently experiencing lags introduced by masters
- reduce # of active jobs running on a master
- reduce # of buildbot steps
- reduce output
  - the reason this impacts step lag is that the log processing is happening over the same channel as the start/stop commands
  - can we make mozharness not output to stdio and make the log_uploader.py upload the Mozharness log and set log_url to it?
- send logs back to the master on bigger chunks (less interruptions of the masters)
- http://hg.mozilla.org/build/buildbotcustom/file/03644c855bb4/bin/log_uploader.py#l111
  - the data is somewhat structured already - that function serializes it out to the current format
bug 1209112 - Virtualenv cache always gets clobbered
bug 1208223 - We lack Mozharness metrics for test jobs (per-action)
We lack per Buildbot steps metrics
- We have some data on pulse but we don't know real elapsedTime
We don't have runner for Windows test jobs
- This would move clean up steps prior to Buildbot start up

Optimizations

Auditing

Evaluate which jobs can be combined or re-shuffled

Sources

Structured logs

Jobs by status active data

buildbot_status    duration
exception              3473
failure             1353995
retry                107128
success           174430338
warnings            8688192

InfluxDB

InfluxDB various DBs admin console
Buildbot master lags: hosted graphite dashboard
- The master lag is calculated by measuring the reported time of one of the initial steps that should be nearly instantaneous
- What is the impact on jobs?
Tree uptimes, end to end, branch load, time per push hosted graphite dashboard

Per buildbot step metrics - pulse stream

grafana runner dashboard
- We only have support for Linux and Mac

User:Armenzg/Test pool efficiency

Contents

Information about jobs

About reboots

From job to job

Known bugs

Optimizations

Auditing

Sources

Structured logs

InfluxDB

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools