Revision as of 22:03, 28 August 2013

Publicize ElasticSearch Cluster with Bugzilla Data

Objective

Have a publicly accessible ElasticSearch cluster containing all historical data on non-sensitive bugs. The hope is this data will attract community and uncover interesting trends.

History

Martin Best started the Bugzilla Anthropology Project, which initiated a need for dashboarding the vast repository of information contained in Bugzilla. Metrics has setup an ElasticSearch cluster with all the historical meta data on the bugs in Bugzilla. This includes data on security bugs. Mozilla now has views of this data [1]

Existing Code

We would like to increment on existing ETL [2]. Unfortunately, the complexity of the installation may reduce the number of interested community members. There are six languages involved:

Manual Process - Running a few command line functions to setup the index, and redirect the alias pointer when done
bash - For the main ETL loop: Grouping the bugs into 10K groups
Spoon - A visual programming tool to connect ETL code snippets, data sources, and data sinks
SQL - To pull data from Bugzilla database directly
Javascript - Used to write convert the fine-grained delta objects into bug version snapshot records
Java - Used to push the bug version records into ElasticSearch cluster

New Design

The plan is to rewrite all code to Python.

Justification

The manual process only exists because the outer loop of ETL is in bash, and it is difficult to perform a dynamic state analysis to fully automate the process. A full Python version will make it easier to probe the existing index state and setup indexes and aliases automatically.
Kettle (Spoon) is a 700Meg piece of software that is specialized for ETL jobs. Many of it's features are not used in this project. The record-level functions provided by this tool must be learnt to fully interpret the other language's code.
The heavy lifting is being done with the main Javascript routine. This can be converted, line-by-line to Python.
SQL can be imported, without change, to the Python version
The Java code to push rows to ES is significantly simpler in Python, and easier to maintain
The current design has no debugging facilities, the Python version will allow stepping through code and easily inspecting intermediate states
The Python version works better with source control; Spoon uses single-large-XML files which are not split by functional group (but admittedly are well behaved with source control)
Be able to add tests, with a familiar test framework, and verify correctness

@@ Line 33: / Line 33: @@
 * The current design has no debugging facilities, the Python version will allow stepping through code and easily inspecting intermediate states
 * The Python version works better with source control; Spoon uses single-large-XML files which are not split by functional group (but admittedly are well behaved with source control)
+* Be able to add tests, with a familiar test framework, and verify correctness

Auto-tools/Projects/PublicES: Difference between revisions