Sheriffing/How To/Mobile Trees/Firebase Test Lab

From MozillaWiki
Jump to: navigation, search

Sheriffing Firebase Test Lab

A playbook for monitoring Firebase Test Lab outages


Firebase Test Lab is a live, cloud-based app testing service that allows developers to test their mobile applications on a variety of devices and configurations. As a live service, it is subject to outages and device failures, which can occur due to various factors such as traffic, device issues, or infrastructure problems.

Device outages can happen at any time, impacting the availability of certain devices or configurations for testing. These outages are a common challenge for online services like Firebase Test Lab, as they rely heavily on a number of devices and infrastructure components.

It is essential for users like us, to be aware of the potential for outages, recognize outages, and to have a plan in place for handling them, such as this playbook. By understanding the nature of live services, and planning for potential disruptions, developers and sheriffs can manage their testing processes and minimize the impact of outages on their overall development cycle.

Our Flank configurations for the Android projects are configured for re-running a test if it fails. As well, in the event of an inconclusive result returned, our UI test tasks will re-run.

Firebase Test Lab Outage - Investigation (Test Ops)

When investigating sudden test failures to determine whether they are a result of an outage, Mobile Test Ops will take the following steps

  1. Verify the outage

First, verify that there is a Firebase Test Lab outage by checking the Firebase status dashboard.

  1. Example test results

Look for patterns in the test failures, such as multiple tests failing simultaneously or tests failing with similar error messages. This may indicate that the failures are related to an infrastructure issue rather than individual test cases or recent code regression. We will also see if there’s issues with the device pending to look for traffic issues.

  1. Retry the tests

The test task configuration is set to retry automatically on Inconclusive Result (infrastructure problem). Run the tests again to see if the failures are consistent and reproducible. If the failure persists, it may be an indication of an outage or infrastructure issue. Check the Firebase status dashboard.

  1. Monitor Communications

Mobile Test Ops will start a thread on relevant communication channels for updates and discussions related to potential outages or issues affecting Firebase Test Lab.

  1. Contact Firebase Community Slack

Mobile Test Ops will start a thread on Firebase Community Slack if the issue persists and we suspect an outage or infrastructure problem is happening. We will reach out to the Firebase Test Lab team for assistance over Slack.

  1. Disable test(s)

In situations where a test or a small number of tests are failing, it’s important to isolate the problematic test(s) and determine whether it’s necessary to disable it temporarily if we believe it may be related to elevated failure rates on devices. If the test failure is likely due to external factors (e.g, service disruption on Test Lab), disabling the test is an appropriate course of action. This decision should be based on the severity of the issue, the impact on the overall test suite, and the estimated time required to resolve the problem.

Mobile Test Ops and Sheriffs should communicate the decision. We will inform team members about the decision to disable the test, the reason behind it, and any necessary actions to be taken (e.g, fixing the issue or waiting for an outage to be resolved). This ensures that everyone is aware of the situation and can plan their work accordingly.

Mobile Test Ops will keep track of the disabled test(s) and re-enable them once the issue has been resolved. This ensures that the test coverage remains comprehensive and that the disabled test does not get forgotten.

Both Mobile Test Ops and or sheriffs on duty can disable a test temporarily using the Ignore annotation in Android tests. Once a test is disabled, a bug should be filed appropriately in Bugzilla.

Firebase Test Lab Outage - Infrastructure

When a Firebase Test Lab outage occurs, follow the steps outlined below to handle the situation.

  1. Verify the outage

First, verify that there is a Firebase Test Lab outage by checking the Firebase status dashboard. If the dashboard confirms an outage, proceed to the next step.

  1. Notify the team

Inform Mobile Test Engineering and Mobile Android Team about the outage if not already communicated by sending a message in Slack. Make sure to include details about the outage if available.

  1. Monitor the situation

Keep an eye on the Firebase status dashboard for updates on the outage. Firebase will typically provide information about the cause of the outage and an estimated time for resolution. Additionally, Mobile Test Engineering will monitor the #test-lab channel on Firebase Slack for updates and discussions related to the outage.

  1. Update the team on the outage status

As the situation evolves, provide updates to teams on the status of the outage. This includes any progress updates from Firebase or if the outage has been resolved.