Confirmed users
513
edits
Klahnakoski (talk | contribs) |
Klahnakoski (talk | contribs) |
||
| Line 41: | Line 41: | ||
This project took more effort than expected. Here are some of the complications that slowed down development. Please keep in mind I only had a couple months of Python before doing this conversion, feel free to take pleasure at my ignorance: | This project took more effort than expected. Here are some of the complications that slowed down development. Please keep in mind I only had a couple months of Python before doing this conversion, feel free to take pleasure at my ignorance: | ||
=== ETL Issues === | |||
* Python and Javascript property access is different enough to cause a multitude of bugs when just performing naive conversion: For example, converting Javascript <tt>if (!a.b){ ... }</tt> to Python <tt>if not a["b"]: ....</tt> can emit key exceptions and simply take the wrong path when dealing with empty sets. | * Python and Javascript property access is different enough to cause a multitude of bugs when just performing naive conversion: For example, converting Javascript <tt>if (!a.b){ ... }</tt> to Python <tt>if not a["b"]: ....</tt> can emit key exceptions and simply take the wrong path when dealing with empty sets. | ||
* Alias analysis is error prone: Email address used by users can be changed, and there is no record of those changes. The bug activity table has recorded changes for emails that "apparently" do not exist. Well, they do exist, but are aliased. The old ETL used reviews to do some matching. The new version uses the CC lists which have more information. The problem is fundamental corruption in the history caused by (possible) direct poking of the database. This corruption must be mitigated with fuzzy logic. | * Alias analysis is error prone: Email address used by users can be changed, and there is no record of those changes. The bug activity table has recorded changes for emails that "apparently" do not exist. Well, they do exist, but are aliased. The old ETL used reviews to do some matching. The new version uses the CC lists which have more information. The problem is fundamental corruption in the history caused by (possible) direct poking of the database. This corruption must be mitigated with fuzzy logic. | ||
* It took a while to build up a library of tests that could be used to verify future changes. More tests => more test code => more bugs in test code => more bugs found in production code => more tests. Sometimes it seemed endless. | * It took a while to build up a library of tests that could be used to verify future changes. More tests => more test code => more bugs in test code => more bugs found in production code => more tests. Sometimes it seemed endless. | ||
=== Python Issues === | |||
* Python is slow. Python speed comes from the C libraries it uses, spending time in the Python interpreter is a bad idea. For example, going through the characters in all strings to check for invalid Unicode turned a slow program into an unusable one. The solution was to find a builtin library that did the work for me (or would raise an exception if the conditions were false). This ETL program has significant data structure transformations that can only be done in Python. The solution was to move to use the PyPy interpreter. | |||
* PyPy does not work well with C libraries. The C libraries had to be removed in favor of pure Python versions of the same. This was not too hard, except when it came to JSON libraries | * PyPy does not work well with C libraries. The C libraries had to be removed in favor of pure Python versions of the same. This was not too hard, except when it came to JSON libraries | ||
* JSON generation is slow: The built-in JSON emitter used generators to convert data structures to a JSON string, but the PyPy optimizer is terrible at analyzing generator code. Furthermore, the JSON libraries available to CPython are incredibly fast (Ujson is by almost 2 orders of magnitude faster!) This made the PyPy version appear inferior despite the speed up in the ETL portion of the code. Part of the solution was to use PyPy's own JSON emitter, but also realize PyPy's default JSON emitter (no pretty printing, no sub-classing, etc) has Ujson speeds. The fastest solution I found so far, is to copy the data structure (with sets, Decimal, and other special types) to one with simple dicts, lists and floats and pass it to the default PyPy JSON emitter[https://github.com/klahnakoski/pyLibrary/blob/61928e3c9b01b823d666bafcc68b90ab2e4199e3/tests/util/test_json_speed.py]. | * JSON generation is slow: The built-in JSON emitter used generators to convert data structures to a JSON string, but the PyPy optimizer is terrible at analyzing generator code. Furthermore, the JSON libraries available to CPython are incredibly fast (Ujson is by almost 2 orders of magnitude faster!) This made the PyPy version appear inferior despite the speed up in the ETL portion of the code. Part of the solution was to use PyPy's own JSON emitter, but also realize PyPy's default JSON emitter (no pretty printing, no sub-classing, etc) has Ujson speeds. The fastest solution I found so far, is to copy the data structure (with sets, Decimal, and other special types) to one with simple dicts, lists and floats and pass it to the default PyPy JSON emitter[https://github.com/klahnakoski/pyLibrary/blob/61928e3c9b01b823d666bafcc68b90ab2e4199e3/tests/util/test_json_speed.py]. | ||