This page describes our planning and thinking about the direction the Raindrop/BackEnd should take.
Better leverage of CouchDB
There are some ways which CouchDB could help us.
Better 'externals' support
Currently our API is implemented using a CouchDB 'external'. It would be great if:
- CouchDB allowed more than one request at a time. Currently, requests are "queued" - so if our front-end issues 2 API requests simultaneously, one must wait until the other is complete before starting. It appears as though one of the requests takes twice as long as it actually does. This is reflected in a couchdb bug.
- Provide enough information so the external process can connect to the database "hosting" the external. While the database name can be deduced, the address and port on which the database can be connected doesn't seem available. This is being tracked in this bug
Formal Schema Definitions
We hope to come up with formal schema definitions and store them directly in couch documents - kind of 'meta-documents'. Apart from defining the names and types of fields, other interesting meta-data could be stored, such as:
- The list of 'rd_megaview_expandable' fields (this field is described in Raindrop/Megaview#Value_expansion)
many sub-processes/languages model
One possible strategy would be to take a leaf from couchdb's execution model. This model may look something like:
- Each extension gets its own OS process
- Documents which meet the extension's criteria are sent to this sub-process's stdin using json encoding.
- This sub-process driver script calls the extension function gathering the results it produces, and sends them via stdout back to raindrop.
- Raindrop saves these new documents and runs any new extensions.
Such a scheme has some potential drawbacks though:
- We currently have over 20 back-end extensions. Do we really want one OS process per extension? Attempting to reuse an extension may limit the concurrency we can get (ie, how many extensions we can have executing at once, particularly when these extensions are actually blocked waiting for the couch). Maybe there is a middle-ground...
- Sending json representations over stdin/out adds another layer of overhead and must certainly be slower than passing objects around in memory. Given our current performance characteristics, this may not be acceptable.
Twisted is an amazing Python library for working with asynchronous IO and other operations. It is particularly suited to applications which need to maintain a huge number of connections with the outside world. However, it does have a number of drawbacks which are worth discussing.
This section details the pros and cons of using twisted, with the aim of formulating a plan for the back-end architecture
- debugging twisted applications is more difficult than synchronous programs - the flow of execution isn't linear, making it very hard to "step" over an asynchronous function. The learning curve for experienced programmers, even experienced Python programmers, can be high.
- diagnostics is hard. For example, we have seen twisted get into a fatal loop complaining about too many file selectors in select - but there is no reason our application should have such a large number of asynchronous requests outstanding. Twisted doesn't seem to offer much in the way of help to determine what is going on other than a fairly coarse 'debug' flag on deferred instances. Worse, when twisted gets into this state it remains in an infinite loop spewing error messages to the log until forcibly terminated.
- if you need to use non-twisted APIs, you are forced to use the twisted thread-pool to simulate an asynchronous operation. Currently all the libraries we use, with the exception of paisley, are not twisted aware. Further, this tends to limit the choice of libraries - eg, paisley is lacking some very useful features found in other libraries, and although we stuck with paisley for its twisted support, this meant we needed to re-implement some features we would have got for free in other libraries.
- The few twisted aware components we use tend to be buggy and poorly supported. The paisley library appears to have died, and the IMAP support in twisted appears to have bugs which are many years old, appear to be well understood, but remain unfixed (eg, bug 1443). While it might be possible to switch to a non-twisted IMAP library that works better, this tends to defeat the point of using twisted in the first place.
- Raindrop doesn't accept incoming connections from the outside world. Thus, raindrop doesn't have the "massive number of connections" scalability concern which twisted addresses - in our model, the erlang implemented couchdb looks after that.
- It seems difficult to implement 'background tasks' in twisted. It isn't clear how to use twisted to do what would otherwise be done using a low-priority thread.
- Twisted makes it easy to have multiple things going in parallel without managing threads manually. Raindrop does leverage this:
- when executing extensions. However, this roadmap calls for moving extensions outside the raindrop process anyway.
- maintaining up to 6 connections to an IMAP server - but even that implementation uses a queue approach that would be very simple to re-implement using threads.
While twisted is a great library for applications with specific requirements, raindrop doesn't seem to be such an application. It is a barrier to entry for new potential contributors is high and some of the available libraries don't seem to have the same quality as their non-twisted counterparts. We explicitly do not want to expose an asynchronous API to our extension model.