TestEngineering/Performance/Triage Process

From MozillaWiki
Jump to: navigation, search

Triage Workflow

Triage Duty

Your main goal during triage duty is to make sure bugs are labelled appropriately and quickly based on recent bug activity. This might mean checking bug activity once a day, perhaps doing some minimal investigation, and then updating the bug's priority, severity, product, status, need-info, etc.

See Useful Queries.

  • Triage incoming bugs as early as possible or at least once a day.
  • Intermittent failures:
    • Only investigate an intermittent failure if it has happened more than once.
    • Glimpse over the failure details, and if incomplete information has been added as the first comment, add the relevant part of the log as a new comment. ** If it’s a duplicate bug mark it as such, or if not related to the component move it immediately to the correct one.
    • Intermittent failures should have a priority of P5 by default, unless they need investigation and a fix immediately. Then set a priority of P2 and find an owner.
    • On Monday the triage owner or person on triage duty goes through all the bugs that got updated by the intermittent failures bot. If there is a top-occurring failure make sure to assign the bug to someone familiar with the affected code. Failures which happened less often (like lesser than 10 times in the last week) you can simply ignore.
  • Untriaged bugs:
    • Bugs without a priority set should move to P3 by default, which means it will be fixed at some point. Only set P2 if the bug blocks current OKRs.
  • Mentored bugs:
    • It's generally up to the bug mentor to keep these bugs in good shape. Feel free to need-info the mentor if you have any doubts.
    • Set needinfo on the most recent contributor if they haven't replied for more than a week.
    • Never set a contributor as assignee. This will be done automatically by Phabricator when the initial patch gets submitted. Reset the assignee and set the bug to new if no further response comes in within a week.
    • Leave the priority as is and don't change it to P1 if such a bug gets assigned.
  • If it is not clear how to proceed on the bug, or if further input is necessary from stakeholders, add the whiteboard entry [perftest:triage]. Those bugs will be discussed in the next triage meeting.

Review Queue

To ensure that we're responding to review requests in a timely manner, the #perftest-reviewers group is triaged once/day. This involves tracking the number of open review requests and assigning a team member to be responsible for the review.

  • Open the Reviews sheet in this spreadsheet.
  • Click the green + button to create a new row with today's date.
  • Open the FxPerfTest dashboard in Phabricator.
  • Count the total number of reviews in the perftest tab and enter this into the sheet under the Total column.
    • You may need to switch away from the perftest tab and back to work around an issue where all tab contents are initially displayed.
  • Count the number of reviews that have a last update of more than 24 hours and enter this into the sheet under the >24hrs column.
  • Count the number of reviews that have a last update of more than 48 hours and enter this into the sheet under the >48hrs column.
  • Check that the <24hrs column is automatically populated based on the values entered.
  • Check that the 24-48hrs column is automatically populated based on the values entered.
  • Check that the Review Queue chart reflects the new values.
  • For any review that only have the #perftest-reviewers as the reviewer, assign a team member as a blocking reviewer.
    • The reviewer should be the next team member in rotation to balance the load across the team, however this may not be disirable if many large reviews are building up on an individual. Use the team member tabs to understand the review queue for individuals and your best judgement.

Queries

Triage Duty versus Triage Owner

  • Every bug component has a Triage Owner. This is an ongoing, long-term role.
  • Anyone on the team may be assigned to Triage Duty. This is a short-term role that involves monitoring incoming bugs on a daily basis.
  • The triage team decides who is on triage duty until the next triage meeting, which means triage duty usually rotates on a weekly basis.

Bugs being worked on

We want to make it clear when a bug is actively being worked on, and make it easy for people to pick up any available work.

  • When you start working on a bug (you start implementing a fix), set its priority to P1 and assign yourself.
  • If you stop working on a bug or when it is blocked by another one, reset the priority to its original value, and unassign yourself.
  • If you want to indicate that you plan to work on a bug soon set a need-info to yourself on the bug with a short comment about your plans (Examples: "I will work on this after Bug xyz is done" or "I will start the implementation next week")

Priorities

  • P1 - This bug represents an OKR or an important intermittent failure, and has an assignee working on an implementation
  • P2 - This bug represents an OKR or an important intermittent failure, but no-one is working on it at the moment
  • P3 - This bug will be fixed eventually (non-OKR, mentoring)
  • P4 - Not used (reserved for bots)
  • P5 - Used for intermittent failures, or no intention to fix but will accept patches

Strategies for triaging intermittents

  • Look for patterns in Treeherder's intermittent failures view (platforms, build types, tree etc). This also linked to in the Orange Factor field on each bug.
    • E.g. Investigate the intermittent logs associations with a grain of salt, as Code sheriffs may occasionally misattribute some failure logs.
    • E.g. if 90% of failures happen on Android, and the rest on some desktop platforms, there’s a chance that desktop failures were incorrectly assigned.
  • Recognise and mark duplicates as early as possible (example)
  • Use generic intermittent bugs when availabe (example)
    • Simply ask a Code sheriff to group them (a needinfo? + some guidelines should suffice)
    • Use this when you have lots of bugs covering the exact same underlying issue
      • Pick the oldest bug
      • Replace parts of the bug summary with <random>
      • Use this only for common patterns you notice
      • There are some risks involved here, especially if we’re not entirely sure about the underlying problem. Any mistake could hide other Raptor regressions.
      • Making this too generic can increase the failure rate for what seems to be a common culprit. Code sheriffs will then have more reasons to turn our tests off.

How to handle intermittent bugs which cover a crash of Firefox?

First make sure it doesn't stay in the Testing component but gets moved to a product that covers Firefox as crashes relate to problems to Firefox itself & not to the test harness. Therefore check the crashing thread and find the reported crash frame as listed in the summary of the bug. From there go on and find the first frame that is part of our code. Also check for the following:

  • For a header file (.h) you can most likely continue to the next frame
  • If it is inter-process communication (IPC) related remember that various components make use of it, so check higher in the stack, which code calls into the IPC code.
  • For allocation issues (like OOM) also find the appropriate caller

If not done yet, also add a comment with the link to the exact crash location. Make sure to keep the changeset id in the URL.

  • Figure out the right component
    • Don’t rush & assume that the 1st frame of the crashing thread is the culprit, especially if its corresponding source code points to a header (*.h) file
    • If indeed 1st frame isn’t the culprit, just go to the next frame from the logs.
    • Note: most often, this is not a trivial task. So even if you end up to another source file, it’s still very likely that the problem happens a bit more up the stack. If you get blocked, request an engineer’s assistance and learn from their process.
  • How do you know which engineer to ask for assistance?
    • By looking over the source file (you got stuck at) & figuring out its component (use Mercurial’s blame feature). With that component, you identify the team that likely has more knowledge over the problem. Contact the team & ask someone there to assist you.
    • You can also search for the associated file name in searchfox, and find the corresponding Bugzilla component within the nearest moz.build file.

FAQ

Can I have many bugs assigned to me?

Yes, it's possible to be actively working on several bugs at a time. The definition of "actively working" is loose; you can use your intuition.

When should I unassign myself from a bug that I have started to work on?

As mentioned above, the definition of "actively working" on a bug is loose, so you can use your intuition. If you notice that you won't be making any progress on the bug this week, that probably means that your attention is focused on other work or something else is blocking progress.