Raindrop/Strawman

From MozillaWiki
Jump to: navigation, search

Whereas

  • the couchdb version of raindrop scales very poorly, for a variety of reasons which we understand more or less well.

and

  • even if we could get an order of magnitude increase in the performance of couchdb in our architecture, it'd still be very unaffordable.

and

  • our data model abstractions currently leak too much to the HTTP APIs

and

  • our "spread out" data model makes it too hard for newcomers to understand and work with raindrop

It is resolved that

  • We will do a Raindrop reset.

This reset prioritizes:

  • APIs that make it easier to do front-ends
  • an architecture that takes hosted scaling into consideration
  • use existing battle-tested technologies when possible

After talking to a bunch of people, I'm proposing the follow strawman proposal:

1) We stop using couchdb as a queue, and use a queue instead. Specifically, we use a message queue (rabbit-mq gets consensus). This would enable:

  • understanding the performance cost of message fetching, and allocating those to specific processing units (processes, nodes, etc.)
  • better horizontal scaling

2) We define a clear HTTP API for use by inflow & other front-ends.

3) We use a blob storage to keep raw messages and JSON-normalized messages. MogileFS is a candidate for the hosted version, but we'd probably want a trivial python equivalent for localhost dev.

4) We optimize the pipeline to do all processing of messages in memory, only writing the final processed objects to disk at the end of processing a message, to save massively on DB use.

5) We use a mature ORM (specifically SQLalchemy) as the Raindrop equivalent of Gloda.

Raindrop reset large.png