Confirmed users
513
edits
Klahnakoski (talk | contribs) |
Klahnakoski (talk | contribs) No edit summary |
||
| Line 102: | Line 102: | ||
* Private bug data leaking into public cluster | * Private bug data leaking into public cluster | ||
* ElasticSearch was not meant for direct public access, proxy added (https://bugzilla.mozilla.org/show_bug.cgi?id=879833) | * ElasticSearch was not meant for direct public access, proxy added (https://bugzilla.mozilla.org/show_bug.cgi?id=879833) | ||
= Update (January 9th, 2014) = | |||
Over this past couple of months security reviews have been completed, and the suggested enhancements have been implemented. I also have some stories to tell about my trials and tribulations programming in Python 2.7. | |||
== ETL Highlights == | |||
=== Alias Analysis === | |||
All bugs have a carbon copy (CC) list of users that are mailed when the bug changes. The historical record of this list is kept as a list of added and removed email addresses, with timestamps of course. An issue arises when the user changes their email address: The next change in the historical record will refer to the new email address, and not the old, looking something like this: | |||
{| | |||
|'''Time'''||'''Removed'''||'''Added'''||'''Resulting CC List''' | |||
|- | |||
|Jan 2nd|| ||klahnakoski@mozilla.com||klahnakoski@mozilla.com | |||
|- | |||
|Jan 3rd|| ||mcote@mozilla.com||klahnakoski@mozilla.com, mcote@mozilla.com | |||
|- | |||
|Jan 4th||kyle@lahnakoski.com|| ||mcote@mozilla.com | |||
|} | |||
As humans, we know what happened here: Kyle (me) changed his email address (somewhere between Jan2nd and Jan4th), and then removed himself for the CC list for the bug. The ETL script has no such domain knowledge, and simply sees and inconsistency. A naive rebuilding of the CC list history would have to assume '''kyle@lahnakoski.com''' was in the CC list since the beginning (which is a legitimate situation, but uncommon, for the first snapshot of a bug). In aggregate, with all these mismatches, the naive rebuilding of historical record resulted in concluding many bugs started with long CC lists, that were eventually paired down over time to what currently exists. This pattern is quite opposite of reality; where a bug starts with usually few people and the list grows. | |||
I implemented an alias analysis that uses the inconsistency in the history, specifically '''klahnakoski@mozilla.com''' was added to the CC (+1) but does not exist in the current bug state (+0). '''mcote@mozilla.com''' was added (+1) and exists (+1), so the logic is consistent. We must conclude removal is '''kyle@lahnakoski.com''' (-1) matches to addition of '''klahnakoski@mozilla.com''' (+1) to zero effect (0). Really, we are solving a simple set of equations: | |||
+ k1 + m - k2 == m | |||
=> k1 - k2 == 0 | |||
=> k1 == k2 | |||
More complex cases involving simply more equations and more unknowns: And we solve the system of equations. Lucky for us this is not as hard as algebra class because we have way more equations than we have variables to solve for: Finding a solution does not require all equations, and this helps us with the problem of corrupt history. | |||
Finally, the history is slightly corrupted, which can effect the solutions to this equation. To mitigate this problem, the solutions go through a voting stage to help determine what is the truth. In the case of this alias analysis: I found at least three systems of equations where a solution is found, '''and''' that same solution has been found twice as often as any other, is good enough to conclude one email address matches another. | |||
With alias analysis done, the CC list history looks like they should: Small CC list at the start of a bug's life, growing over time. The alias mapping that results is also used to match review flags (which is another story). | |||
[https://github.com/klahnakoski/Bugzilla-ETL/blob/711810f08951a731dc543c10a0973fc34ed17c6b/bzETL/alias_analysis.py Alias Analysis Code] | |||
=== Proving Correctness === | |||
Bugzilla contains some bugs that should not be made public. These include bugs with specific security concerns, but also infrastructure specific details and other sensitive items. It is important that these do not leak. Making unit and functional tests is not enough because they can only test the known unknowns. The unknown unknowns are inevitable in any code with reasonable complexity and you can not test for those explicitly. Instead I want to perform the easier task of testing against invariants: | |||
# Private bugs should not be in public cluster | |||
# Private comments should not be in public cluster | |||
# Private attachments should not be in public cluster | |||
These queries are more computationally expensive because they scan the whole datastore, but the queries are simple enough to be proven correct. Furthermore, these queries can be run outside the main ETL program, allowing us to phase out these expensive checks when we are confident the ETL code is correct. | |||
[https://github.com/klahnakoski/Bugzilla-ETL/blob/88b8d0249cdb4aca1884bd30bed9d11977fc98a9/tests/resources/python/look_for_leaks.py Code to Look for Leaks] | |||
== Python Tribulations == | |||
This is just some complications I have found with Python. | |||
=== Timezones === | |||
Timezone issues are hard to debug. I wish all code '''everywhere''' dealt with time in GMT (not UTC with its leap seconds (sorry, I am throwing scientists under the bus here, but [http://stackoverflow.com/questions/11279992/math-behind-google-leap-second-smear-formula even Google knows UTC is bad])). The main issue is I am never really certain how the local environment is interpreting my time data during conversion: | |||
# [http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_unix-timestamp MySQL unix_timestamp() interprets parameter in the environment timezone] | |||
# [http://stackoverflow.com/questions/18812638/get-timezone-used-by-datetime-datetime-fromtimestamp Python fromtimestamp() assumes parameter is in local time] | |||
# [http://ecma-international.org/ecma-262/5.1/#sec-15.9.1.8 Javascript's is standardized to '''ignore''' historical changes to Daylight Savings Time] | |||
The main issue was the time it takes to find the correct functions that assume GMT/UTC, or functions that take into proper account what the timezones were in the past. | |||
=== Unicode in Python 2.7 === | |||
Unicode, character encodings, and Python2.7 may be harder than timezones to debug if only because your terminal performs its own conversion to display the data structures you are viewing. Two lines make internationalization easier in Python2.7: | |||
: <code># encoding: utf-8</code> at the top of every file, | |||
: <code>from __future__ import unicode_literals</code> as the first import in every file | |||
This makes strings unicode everywhere (well, not calls to __getattribute__, but at least they are ascii) | |||
The latter line allows me to avoid explicit unicode literals (u"") all through my code, and it works well with the first so you do not end up with <code>u"é" != "é".decode("latin1").encode("utf8")</code> | |||
=== PyPi and setup.py === | |||
Python packaging has not matured to the point where it can handle resource files gracefully. You can mix your resource files with your code, and they get put into a predictable place, but anywhere else and each Python implementation has it's own idea where your resource file will end up: Making it impossible to know where to find them after installation: | |||
The first, and simplest, issue to solve with packaging is to provide a set of pointers to predefined directories. My background is with C# and .Net, and they did a good job making these variables accessible to applications: | |||