User:Armenzg/Test pool efficiency

From MozillaWiki
Jump to: navigation, search

The main focus of this page is to collect information around running *tests*, how to have a good comprehension, good metrics and determine the efficiency of the system.

Information about jobs

Non-running-tests wall time:

  • machine reboot time (if applicable)
  • runner (if applicable)
  • buildslave connecting to master assigning job
  • buildbot steps besides mozharness call
  • buildbot steps lag (due to master lagginess)
  • mozharness non-running-tests actions
    • clobber
    • download-and-extract
    • checkout

About reboots

We always reboot on Windows testers since runner isn't managing all the processes there. We also reboot after any android, emulator, mochitests or reftests, since those change the system state in ways we haven't been able to identify...the only way to get back to a known good state is to reboot.

Reboots can happen as part of runner (post mozharness run) or as part of mozharness.

From job to job

Finish job --> reboot (mh or runner) --> bootup --> runner does pre-flight checks, including running clobber --> starts buildbot

We do know when machines run buildbot again if runner is in place since it reports to influxdb.

NOTE: bootup !== aws slave spinup time

Known bugs

  • We are currently experiencing lags introduced by masters
    • reduce # of active jobs running on a master
    • reduce # of buildbot steps
    • reduce output
      • the reason this impacts step lag is that the log processing is happening over the same channel as the start/stop commands
      • can we make mozharness not output to stdio and make the log_uploader.py upload the Mozharness log and set log_url to it?
    • send logs back to the master on bigger chunks (less interruptions of the masters)
    • http://hg.mozilla.org/build/buildbotcustom/file/03644c855bb4/bin/log_uploader.py#l111
      • the data is somewhat structured already - that function serializes it out to the current format
  • bug 1209112 - Virtualenv cache always gets clobbered
  • bug 1208223 - We lack Mozharness metrics for test jobs (per-action)
  • We lack per Buildbot steps metrics
    • We have some data on pulse but we don't know real elapsedTime
  • We don't have runner for Windows test jobs
    • This would move clean up steps prior to Buildbot start up

Optimizations

Auditing

  • Evaluate which jobs can be combined or re-shuffled

Sources

Structured logs

buildbot_status    duration
exception              3473
failure             1353995
retry                107128
success           174430338
warnings            8688192

InfluxDB

  • InfluxDB various DBs admin console
  • Buildbot master lags: hosted graphite dashboard
    • The master lag is calculated by measuring the reported time of one of the initial steps that should be nearly instantaneous
    • What is the impact on jobs?
  • Tree uptimes, end to end, branch load, time per push hosted graphite dashboard