Socorro:Server2011

From MozillaWiki
Jump to: navigation, search

Intro

This document outlines a plan for the future of the Socorro Server codebase. We're currently pushing the 1.7 codebase forward and slowly integrating the features of the failed 1.8 line to produce the 1.9 system. The idea is to create a system with the advanced features required to push a high volume of crashes without ignoring the needs of external adopters of Socorro with lesser volume needs.

The Evolving Ecosystem

The original Socorro Server of 2008 through 2010 consisted of three major parts: the Collector, the Monitor and the Processor. There was also a crowd of supporting cron routines to generate aggregates and reports. The Monitor and Processor depended on the Postgresql database as its intermediate data communication. While this system was able to scale to a point, in order to achieve processing volume of one hundred percent processing, it was decided that it would be reimplemented to be compatible with the map/reduce framework Hadoop. The act of processing a crash was to become a map/reduce job. This would eliminate the need for the Monitor and Processor.

Socorro 1.8

While reworking the Processor as a map/reduce job was the goal, compromises during design and implementation lead to falling short of that goal. Instead, in version 1.8, we ended up with a system that kept the processors, but eliminated the Monitor. The database was sacked as the coordinating authority between the components in favor of a system involving RESTful communications directly between the components. This didn't scale as was the primary reason for falling back to the older Processor/Monitor/database communications system.

Socorro 1.8 is full of features that do not deserve to be shelved:

  • Collector - mod-wsgi version
  • Processor
    • general refactoring
    • stackwalk_server - the reusable minidump_stackwalk replacement
    • streamProcessor - an engine for reading the output of stackwalk pipedump output
    • signatureUtilities - a module that encapsulates signature generation
    • processor services - the RESTful control api
      • the performance stats system
      • the introspection system
      • priority job insertion
  • Registrar - the central registration hub for processors
    • the stats reporting system
  • Socorro Web Services (middleware)
    • refactoring

Socorro 1.7.x

After the demise of 1.8, the Socorro 1.7 line moved forward on its own. Some enhancements made here should be perpetuated into future versions.

  • iterator/worker framework
  • unit test cleanup

Additional Features

  • Multiprocess support
  • CrashStorageSystem modularization - runtime selectable storage
  • Database implementation of the CrashStorageSystem
  • Run time settable crash source modules for processors
  • logging refactoring - move toward standard

Socorro Future

Running map/reduce jobs in Hadoop Using the streaming interface is a simple as reading from standard in and writing standard out. While no longer plan on actually implementing crash processing as an M/R job, it serves as a good minimal use case. I call this is single shot collector: single process with no dependencies; can be run from the command line.

This single unit becomes the base for more and more complicated constructs. The code that reads from stdin and writing to stdout is encapsulated into a runtime loadable module. This is the CrashStorage class hierarchy. Derivatives of CrashStorage include an HBase class, a Local file system class, our legacy NFS storage and a database storage class.

There is a repeating pattern in Socorro Server applications. Some iterator feeds crash ooids to a swarm or workers. The workers load the crash associated with the ooid, manipulate the crash and write it out some where. This pattern has been captured in the iterator/worker pattern class used in the 1.7.6 crashMover app. In the case of this app, the iterator reads new crash data from the collector storage, and then just turns around and saves it to HBase.

The existing processor uses this pattern, too. An iterator feeds ooids representing both standard and priority jobs to the threaded workers. They process jobs and save them to both the database and hbase. A future processor should be refactored to use the iterator/worker framework as well as the modular crash storage classes.

Storage System Modularization

The 1.7.6 CrashStorageSystem classes encapsulate HBase, local file system and NFS storage. Each is implemented a derived class of a common base. The api of the base class evolved rather than having been engineered and therefore the interface is somewhat scattered and could use some refactoring. The processors were written without an encapsulation of the database storage. This is an omission that should be fixed and the CrashStorageSystem is the obvious solution.

Storage System API

These are the main tasks that the storage class is called to do. A given subclass may also implement some additional special functions

  • save_raw - take a meta json and a dump and save them
  • save_processed - save the processed dump in a json form
  • get_meta - retrieve the original meta json
  • get_raw_dump - retrieve the original binary dump
  • get_processed - retrieve the processed dump
  • remove - delete the meta json, binary dump and processed dump
  • uuidInStorage - return true if an ooid exists within the store
  • newUuids - return an iterator of any new ooids added to the store where a processed dump has not been saved.

The Storage System Derived Classes

CrashStorageSystemForHBase

This class is Socorro's interface with HBase. It wraps the Thrift code that lives within HbaseClient to provide the standard api. This Thrift code is destined to be replaced with something more robust.

CollectorCrashStorageSystemForHBase

This is a specialized verision of the crash storage for HBase. The only difference is that it implements a fallback storage mechanism if HBase is unavailable. The fallback storage is implemented on disk.

CrashStorageSystemForLocalFS

This version of crash storage introduced in 1.7.6, stores crash data directly into a local file system. Like CollectorCrashStorageSystemForHBase it implements a fallback storage if the primary storage system is full or unavailable. Both file systems are implemented using the JsonDumpStorage scheme from an early 2009 implementation of Socorro.

This class does not implement the full CrashStorageSystem api. Since it is used for storage by the Collector for json meta data and raw crashes, but not for processed crashes, it doesn't implement the processed crash storage. Any attempt to use a processed crash api call will result an a "not implement" exception being raised.

CrashStorageSystemForNFS

This version of crash storage has never been used in production code. It is an implementation of an earlier version of the Socorro Server storage techniques within the CrashStorageSystem framework. It uses two file systems, one for primary storage and one for deferred storage. Both file systems are implemented using the JsonDumpStorage scheme from an early 2009 implementation of Socorro. This class was written as a "just in case" fallback if the original attempt at the migration to HBase failed. Like the CrashStorageSystemForLocalFS, it is used primarily for storage of crashes from the collector before processing. There is no implementation of processed crash storage.

CrashStorageSystemForPostgres

This class doesn't yet exist. There is no reason that the CrashStorageSystem api couldn't be adapted to encapsulate the database. Saves of raw and processed data would be lossy since the database is selective in what parts it stores.

DualCrashStorageSystem

The class doesn't exist yet. The processor is an example of a system that needs to store crash data into two crash storage repositories at the same time. This class holds two instances of other CrashStorage. The class just passes the api calls through to both of its subordinate instances. The api pass throughs would happen in sequence: first one and then the other. Theoretically, the saves could be done on their own threads, but the value of that is not proven.

The processor would be the primary client for this class. Postgres and HBase CrashStorage instances will be encapsulated in DualCrashStorage instance as the processed crash data sink.

Configurable Crash Storage

With the CrashStorage classes, the framework exists to allow any Socorro Server app that uses a crash storage to save data in any of these formats. With some additional features of the ConfigurationManager, any app could specify its storage technique at run time. This is a primary feature that will allow other projects to continue to use Socorro.

ConfigurationManager Namespaces

The configuration manager needs some changes to enable be able to dynamically load a class in the right context. Take, for example, the CrashMover app that copies from one source to one destination. The configuration file will need to specify the source and destination CrashStorage classes and the parameters required for those classes. However, if it needs to copy from one local file system to another, the source and destination classes are the same. The means there would be name collisions in the configuration parameters for these local file systems.

If the configuration file were to add namespaces so there could be, for example, "source.root" and "destination.root". The ConfigurationManager could then present to the CrashStorage constructors the proper initialization dictionary for which ever CrashStorage instance that it was initializing.

 source = ConfigurationManager.namespace('source'))
 
 source.storageClass = ConfigurationManager.Option()
 source.storageClass.doc = 'fully qualified class name for source storage'
 source.storageClass.default = 'socorro.crashStorage.CrashStorageSystemForLocalFS'
 source.storageClass.fromStringConverter = ConfigurationManager.classLoader
 
 source.root = ConfigurationManager.Option()
 source.root.doc = 'a path to a local file system'
 source.root.default = '/some/path/for/storage'
 
 # ...
 
 sourceCrashStorage = config.source.storageClass(config.source)

Iterator / Worker Model

This is the recurring theme of Socorro Server apps. An iterator of some sort feeds ooids to a set of workers that then do something with those ooids. This has been reduced to a simple framework in the IteratorWorkerFramework class. The iterator and the task that workers perform is pluggable. The crashMover app already uses this framework. The Processor could/should also be converted to use this class.

Parallel Processing

The framework uses threads for the workers. There is no reason that this couldn't be converted to use he multiprocess module instead.

Implementation Order

  • bring middleware refactoring forward
  • ConfigurationManager namespaces
  • bring the 1.8 Processor forward
    • mod the input iterator (RESTful ooid push) to use monitor's crash queuing
    • mod the output to use the DualCrashStorage: HBase & Postgres
  • bring the 1,8 Registrar forward
  • replace Monitor with (RESTful pull from HBase?)
  • refactor IteratorWorker to use multiprocessing
  • refactor Processor
    • use IteratorWorker
    • refactor logging