Mobile/Testing/04 24 13

From MozillaWiki
Jump to: navigation, search

Previous Action Items

  • Gbrown to follow up with the Tree Sheriffs to get robocop tests unhidden once more now that the strategic test disabling seems to have been done
    • Follow-on - once the pandas are re-wired, we'll send a try job re-enabling those tests so that we can see if those particular tests were causing reboots due to increased CPU activity thus causing a power spike.
    • -> Panda rc is no longer hidden
    • -> Disabled rc tests still cause failures if enabled
  • Jake and Kim will have all the pandas upgraded with new power infrastructure by Monday
  • Dan will let us know at the next meeting where we stand w.r.t. the amount of work estimated to replace tegras with pandas running 2.3.x.
    • Follow-on once we know that, Joduinn and I (ctalbert) will need to talk with Karen and Blassey about their projected timelines for EOL'ing 2.2 support.

Status reports

Dev team

  • Found a cause of "2400 seconds without output" failures bug 663657

Rel Eng

  • (kmoir) brought down masters to facilitate chassis maintenance. Mozpool/mozharness work for android pandas.

IT

  • Still working on a higher density chassis. Just waiting for the prototype chassis to be fabricated.
  • bug 860028 Replacing 5v supply wire and adj power supply output in panda chassis in scl1 - COMPLETED

A Team

General

  • tegra failure rate: [7.00%]
    • tpn, r4, m1
  • panda failure rate: [14.02%]
    • m2, rc1, rc2, j3, talos-s
  • I am seeing very little change in the frequency for bugs:
    • bug 822321 - Intermittent Panda "Could not connect; sleeping for 5 seconds. reconnecting socket"...
      • tegra M1, panda rc1, rc2 <- top failure listed above
    • bug 663657 - Intermittent Android "command timed out: 2400 seconds without output, attempting to kill"
      • panda m2, rc2 <- top failure listed above
    • bug 807230 - Intermittent DMError: Automation Error: Timeout in command {ls,ps,isdir,mkdr}, ...
      • doesn't happen in talos! but evenly distributed across reftest/mochitest/robocop
  • the above bugs should have been reduced with the wiring change.
  • investigating "rouge" pandas
    • during the smoketests to validate the wiring change, we say about 10% of the pandas being problematic. Average panda failure rates were 1-5%, but these "rouge" pandas were 7-15%.
    • running just those pandas standalone yielded the same results as running with all the other pandas
    • total smoketest failure rate 4.5%, without 10% of pandas 3.1%.
    • How can we detect these?
    • proposal:
      • detected 20 jobs in the last 48 hours for a given panda
      • detected >=2 failures for that given panda in the last 48 hours
      • safeguard: if we detect >15% of the pool, just flag somebody in case there is a infra outage or a few bad builds
      • remediate: pull panda reflash panda, reseat sdcard
      • correction: if panda is "remediated" 3 times in 30 days, change SD Card
      • dead: if we have hit the correction stage 3 times for a given panda, throw away the board
  • collecting network traffic using wireshark might help us to distinguish between connectivity issues due to reboots and other possible connectivity problems

Android 2.3.5

  • Current status is at: bug 859766.
  • Largest issue seems to be timeouts, possibly due to losing focus
    • a patch for this aimed at b2g landed recently, I will retest and see if things have improved
  • Need to discuss prioritization / timelines with respect to other tasks.
    • estimate 3 months of work to stand up 2.3.5 on pandas, with another quarter or so of bug fixes / maintenance

x86 automation

  • I am running throught the mochitests to get a rough idea of how stable the emulator is
    • I do see some timeouts and occasional process crashes. I'm planning to rerun some of this on the actual phone to hopefully determine if this an emulator issue or a product stability issue

Autophone

  • [bc] Adding additional 1 Samsung GS II and 2 GS III phones.
  • [bc] bug 862456 Security Review for Phonedash
  • [bc] Testing throbber start performance with original fennec launch code vs. mozbase's launchFennec with and without -W parameter to am.
  • [bc] Planning to investigate using standard deviation to gate retests in attempt to reduce jitter.

Eideticker

Round Table

  • should we disable tests that are hard to fix and known to cause a lot of failures?
    • specifically webgl!
  • tbpl starring sometimes posts process crash and timeout/connectivity bug, even though all the tests have completed
    • should we fix this?
    • should we detect if a harness has completed and then only report shutdown failures?
    • other ideas?

Action Items

  • (jmaher) explain your round table items
  • (ctalbert) get kim a known good build
  • (kim) run tests over the weekend
  • (ctalbert) to email bad news
  • (ateam) split out webgl from mochitest-1
  • (wlach) to add stock and chome to new test
  • (blassey) follow up with karen to get 2.2 end of life plan