Auto-tools/Projects/Signal From Noise/Meetings/2013-04-11
Please view live action play by play notes on [etherpad]
Previous Action Items
- [jmaher] - add a email alias that we could all get emails from datazilla while we investigate this.
- [jmaher/kyle] - start work on refining automated regression detection
- [jeads] - initial v1 of the timeline/pushed based history graphs for overall job status and page status.
- BenWa has proposed adding thresholds per test to the tree, everything outside the threshold would be orange and require a perf sherriff to adjust the value. (carry over)
- we should still consider looking at the hardware pool
- Consider splitting hardware pool
- Consider running talos on a few builds on a set of machines over and over to see what noise metrics we get (see how much noise is based on differences in machines versus in the build and install of the product)
Overview of tp5o and next steps
- displaying the history for the page over time
- example of current UI: https://datazilla.mozilla.org/talos/summary/Mozilla-Inbound-Non-PGO/91b7d9a8c226?product=Firefox&branch_version=22.0a1
- example of useful data (timecourses example that jeads is talking about): http://people.mozilla.com/~jeads/summary.html#
- need to cache data in real time
- how realistic is it to have the data in real time before the buildbot job ends (see below)
What the sheriff's need to determine the pass/fail nature of the talos performance tests
- First, how much can we (sheriffs) determine at the job result time from buildbot has success/failure then that would make things much easier (works with existing workflow)?
- Jeads/Jmaher:Where you post the data structure, the database would need to be queried and would come back wtih the information as a response to your post. That's not unreasonable, but it won't scale to the large number of talos machines in the automation.
- The issue here is that we don't want to trigger index performance data queries at the same time as you're doing the insert of data.
- Really, this comes down to the way the current tbpl is architected. In Treeherder, we can post the results back to Treeherder asynchronously to the proper job guid and thus provide the same work flow.
- The thing to think about is that it is two different processes -- there is the ingestion process to take in the data, a separate calculation process to calculate the metrics and we can control that load.
- Ctalbert idea for current tbpl: can we make buildbot do a second step to do a query to check to see if its results are ready before it finsihes posting the result?
- what about pulse to send results to the buildbot database
- [edmorley] - lets just send an email to dev.tree-management from datazilla when it processes the results - we can initially send to just sheriffs & once reliable, switch over to dev.tree-management. Bonus of this approach is that we can adapt the email push notifcations to push to treeherder later on with minimal wasted effort (vs working on pollers as part of the talos run, or re-architecting TBPL, which is soon to be EOL). We can then switch testsuites over one by one from the graphs.m.o regression email script (https://hg.mozilla.org/graphs/file/tip/server/analysis/analyze_talos.py) to the new datazilla emails.