Changes

Jump to: navigation, search

EngineeringProductivity/Projects/ActiveData

767 bytes added, 22:31, 17 February 2016
change intro, add dependencies and users
= Overview =
ActiveData is a publicly accessible data warehouse containing the results collection of the about 8 billion records (Feb 2016) covering unit tests run , Buildbot jobs, performance data, and mercurial. This collection is publicly available, and can be queried directly, similar to any database.  ActiveData is built on Mozilla's productstop of ElasticSearch, a fast, distributed, redundant document store. ActiveData provides the benefits of familiar and succinct SQL by translating SQL-like queries to ElasticSearch queries,  
== Problem ==
In order to improve our testing infrastructure we require data on how that infrastructure is performing. That information can be extracted from the raw logs, but that requires downloading samples, parsing data, insertion into a database (or worse, writing queries in an imperative language, like Python). When we are done an analysis we have effectively built an ETL pipeline that does not scale, and is too specific to be reused elsewhere. The next project does this work all over again.
 
== Solution==
ActiveData will serve as a reusable ETL pipeline; annotating the test results with as much relevant data as possible. It also provides a query service to explore and aggregate the data, so there is minimal setup required to access this data.
 
= Design =
ActiveData attempts to provide the benefits of an available database to the public; except larger and faster.
== Limitations ==
The unittest data is limited to those test suites that generate structured logs. Currently (July 29Feb, 20152016) the following do NOT have structured logs, and are NOT in ActiveData:
* jsreftest
* crashtest
* cppunittest
* and any of the js based gaia suites (e.g Gij)
Specifically, you can see if a structured log is being generated: In treeherderTreeherder, click a job. Under the "Job details" pane at the bottom, look for a line similar to:
<blockquote> ''artifact uploaded: <suite>_raw.log''
ActiveData is not meant to replace an application database. Applications often track significantly more data related to good interface design, process sequences, complex relations, and object life cycles.
ActiveData's simple model makes it difficult to track object life cycles and impossible to model many-to-many relations.
Data is not live, and definitly definitely does not track "pending jobs" like TreeHerder or TaskCluster do. Test results may take a day, or more, to be indexed.
= Dependencies / Who will use this =
== Dependencies ==
 
ActiveData's ETL pipeline ingests data from a variety of sources. This
 
* Buildbot
* Mozharness
* Structured Logs
* Task Cluster (end of Q1 2016)
* PerfHerder
* hg.mozilla.org
 
== Users ==
 
ActiveData's primary goal is to support dashboards that give Mozilla useful perspectives into the large amount of data:
 
* Individual unit test results
* Buildbot test times
* Firefox compile times
* Recently new, removed, and disabled tests
* Buildbot wait times
= Let's Use It! =
The service listens at http://activedata.allizom.org/query and accepts queries in [https://github.com/klahnakoski/QbActiveData/blob/masterdev/docs/Qb_ReferenceQb_Tutorial.md Qb format].
curl -XPOST -d "{\"from\":\"unittest\"}" http://activedata.allizom.org/query
== The Query Tool ==
The ActiveData service is intended for use by automated clients, not humans. The [http://activedata.allizom.org/tools/query.html Query Tool] is a minimal web page for humans to do some exporationexploration, and to test phrasing queries.
* [http://activedata.allizom.org/tools/query.html ActiveData QueryTool]
== Documentation ==
* [https://github.com/klahnakoski/ActiveData/blob/master/docs/Qb_Tutorial.md Simple tutorial]
* [https://github.com/klahnakoski/ActiveData/blob/master/docs/Unittest%20Schema.md Unittest Unit test results schema]
= Code =
Confirm
513
edits

Navigation menu