Confirmed users
513
edits
Klahnakoski (talk | contribs) |
Klahnakoski (talk | contribs) |
||
| Line 43: | Line 43: | ||
* Python and Javascript property access is different enough to cause a multitude of bugs when just performing naive conversion: For example, converting Javascript <tt>if (!a.b){ ... }</tt> to Python <tt>if not a["b"]: ....</tt> can emit key exceptions and simply take the wrong path when dealing with empty sets. | * Python and Javascript property access is different enough to cause a multitude of bugs when just performing naive conversion: For example, converting Javascript <tt>if (!a.b){ ... }</tt> to Python <tt>if not a["b"]: ....</tt> can emit key exceptions and simply take the wrong path when dealing with empty sets. | ||
* Python is slow. Python speed comes from the C libraries it uses, spending time in the Python interpreter is a bad idea. For example, going through the characters in all strings to check for invalid Unicode turned a slow program into an unusable one. The solution was to find a builtin library that did the work for me (or would raise an exception if the conditions were false). This ETL program has significant data structure transformations that can only be done in Python. The solution is to move to use the PyPy interpreter. | * Python is slow. Python speed comes from the C libraries it uses, spending time in the Python interpreter is a bad idea. For example, going through the characters in all strings to check for invalid Unicode turned a slow program into an unusable one. The solution was to find a builtin library that did the work for me (or would raise an exception if the conditions were false). This ETL program has significant data structure transformations that can only be done in Python. The solution is to move to use the PyPy interpreter. | ||
* PyPy does not work well with C libraries. The C | * It took a while to build up a library of tests that could be used to verify future changes. More tests => more test code => more bugs in test code => more bugs found in production code => more tests. Sometimes it seemed endless. | ||
* JSON generation is slow: The built-in JSON emitter used generators to convert data structures to a JSON string, but the PyPy optimizer is terrible at analyzing generator code. Furthermore, the JSON libraries available to CPython are incredibly fast (Ujson is by almost 2 orders of magnitude faster!) This made the PyPy version appear inferior despite the speed up in the ETL portion of the code. Part of the solution was to use PyPy's own JSON emitter, but also realize PyPy's default JSON emitter (no pretty printing, no sub-classing | * PyPy does not work well with C libraries. The C libraries had to be removed in favor of pure Python versions of the same. This was not too hard, except when it came to JSON libraries | ||
* JSON generation is slow: The built-in JSON emitter used generators to convert data structures to a JSON string, but the PyPy optimizer is terrible at analyzing generator code. Furthermore, the JSON libraries available to CPython are incredibly fast (Ujson is by almost 2 orders of magnitude faster!) This made the PyPy version appear inferior despite the speed up in the ETL portion of the code. Part of the solution was to use PyPy's own JSON emitter, but also realize PyPy's default JSON emitter (no pretty printing, no sub-classing, etc) has Ujson speeds. The fastest solution I found so far, is to copy the data structure (with sets, Decimal, and other special types) to one with simple dicts, lists and floats and pass to the default PyPy JSON emitter[https://github.com/klahnakoski/pyLibrary/blob/61928e3c9b01b823d666bafcc68b90ab2e4199e3/tests/util/test_json_speed.py]. | |||
* Python has old and has non-intuitive routine names (strftime, mktime, randrange, etc) these take time to find, and confirm if there isn't a later library that should be used instead. I opted to add a facade to all of them to re-envowel their names, and isolate myself from the risk of using the wrong lib (or have it behave in unexpected ways). | * Python has old and has non-intuitive routine names (strftime, mktime, randrange, etc) these take time to find, and confirm if there isn't a later library that should be used instead. I opted to add a facade to all of them to re-envowel their names, and isolate myself from the risk of using the wrong lib (or have it behave in unexpected ways). | ||
* Python2.7 strings are confusing: str() can be either ASCII or UTF8 encoded, but without any typing to indicate which encoding is used. There are also unicode() strings, which look like strings until you try to compare them: <tt>"é"!=u"é"<br/> | * Python2.7 strings are confusing: str() can be either ASCII or UTF8 encoded, but without any typing to indicate which encoding is used. There are also unicode() strings, which look like strings until you try to compare them: <tt>"é" != u"é"<br/> | ||
* Multithreading was necessary so we can handle multiple network requests at one time, while keeping the code easy to read. Python's threading library is still immature in that it has no higher level threading constructs to deal with common use cases in an environment that raises exceptions. | |||
* Python2.7 has no exception chaining - added it | |||
* | |||