Socorro:HBase

From MozillaWiki
Jump to: navigation, search

General HBase information

Required Reading

Useful Links

Notes

  • Socorro uses the Thrift API to allow the Python layer to interact with HBase. Every node in the cluster runs a Thrift server and they are all part of a VIP that only production servers can access.
  • Column families
    • Example: A common column family Socorro uses is "ids:" and a common column qualifier in that family is "ids:ooid". Another column is "ids:hang"
    • The table schema enumerates the column families that are part of it. The column family contains metadata about compression, number of value versions retained, and caching.
    • A column family can store tens of thousands of values with different column qualifier names.
    • Retrieving data from multiple column families requires at least one block access (disk or memory) per column family. Accessing multiple columns in the same family requires only one block access.
    • If you specify just the column family name when retrieving data, the values for all columns in that column family will be returned.
    • If a record does not contain a value for a particular column in a set of columns you query for, there is no "null", there just isn't an entry for that column in the returned row.
  • Manipulating a row
    • All manipulations are performed using a rowkey.
    • Setting a column to a value will create the row if it doesn't exist or update the column if it already existed.
    • Deleting a non-existent row or column is a no-op.
    • Counter column increments are atomic and very fast. StumbleUpon has some counters that they increment hundreds of times per second.
  • Tables are always ordered by their rowkeys
    • Scanning a range of a table based on a rowkey prefix or a start and end range is fast.
    • Retrieving a row by its key is fast.
    • Searching for a row requires a rowkey structure that you can easily do a range scan on, or a reverse index table.
    • A full scan on a table that contains billions of items is slow (although, unlike an RDBMS it isn't likely to cause performance problems)
    • If you are continually inserting rows that have similar rowkey prefixes, you are beating up on a single RegionServer. In excess, it is unpleasant.

Socorro HBase Schema

DRAFT
The content of this page is a work in progress intended for review.

Please help improve the draft!

Ask questions or make suggestions in the discussion
or add your suggestions directly to this page.

Table crash_reports

 {NAME => 'crash_reports', FAMILIES => [{NAME => 'flags', COMPRESSION =
 > 'NONE', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', 
 IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'ids', VERSIONS 
 => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '6553
 6', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta_data',
  COMPRESSION => 'LZO', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE
  => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'pr
 ocessed_data', VERSIONS => '1', COMPRESSION => 'LZO', TTL => '21474836
 47', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
 , {NAME => 'raw_data', COMPRESSION => 'LZO', VERSIONS => '3', TTL => '
 2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>
  'true'}, {NAME => 'timestamps', VERSIONS => '1', COMPRESSION => 'NONE
 ', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BL
 OCKCACHE => 'true'}]}                                                 

Index Tables

Index crash_reports_index_hang_id

{NAME => 'crash_reports_index_hang_id', 
FAMILIES => [{NAME => 'ids', COMPRESSION => 'LZO', VERSIONS => '1', 
TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', 
BLOCKCACHE => 'true'}]}

Index crash_reports_index_hang_id_submitted_time

{NAME => 'crash_reports_index_hang_id_submitted_time', 
FAMILIES => [{NAME => 'ids', COMPRESSION => 'LZO', VERSIONS => '1', 
TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', 
BLOCKCACHE => 'true'}]}

Index crash_reports_index_legacy_submitted_time

{NAME => 'crash_reports_index_legacy_submitted_time', 
FAMILIES => [{NAME => 'ids', COMPRESSION => 'LZO', VERSIONS => '1', 
TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', 
BLOCKCACHE => 'true'}]}

Index crash_reports_index_legacy_unprocessed_flag

{NAME => 'crash_reports_index_legacy_unprocessed_flag', 
FAMILIES => [{NAME => 'ids', COMPRESSION => 'NONE', VERSIONS => '1', 
TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', 
BLOCKCACHE => 'true'}, {NAME => 'processor_state', VERSIONS => '5', 
COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', 
IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

Index crash_reports_index_signature_ooid

{NAME => 'crash_reports_index_signature_ooid', 
FAMILIES => [{NAME => 'ids', COMPRESSION => 'LZO', VERSIONS => '1', 
TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', 
BLOCKCACHE => 'true'}]}

Index crash_reports_index_submitted_time

{NAME => 'crash_reports_index_submitted_time', 
FAMILIES => [{NAME => 'ids', COMPRESSION => 'LZO', VERSIONS => '1', 
TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', 
BLOCKCACHE => 'true'}]}

Index crash_reports_index_unprocessed_flag

{NAME => 'crash_reports_index_unprocessed_flag', 
FAMILIES => [{NAME => 'ids', VERSIONS => '1', COMPRESSION => 'NONE', 
TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', 
BLOCKCACHE => 'true'}, {NAME => 'processor_state', COMPRESSION => 'NONE', 
VERSIONS => '5', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', 
BLOCKCACHE => 'true'}]}

Table metrics

records containing aggregate metrics for varying time intervals

yyyy 
yearly
yyyy-mm 
monthly
yyyy-mm-dd 
daily
yyyy-mm-ddThh 
hourly
yyyy-mm-ddThh:mm 
per minute

special records

crash_report_queues 
contains metrics about the current state of the processing queues


Over time, we will expire finer grained records and roll them up into the next higher level. We should have lots of room to grow and we can figure out what sort of expiration we wish to set on these. The only argument for a strict expiration is that there are certain metrics which will only exist in the higher level stats (for instance, we wouldn't be able to generate hourly or per minute ADU numbers so those metrics would only exist in the daily records).

There are two column families in this table, 'counters:' and 'timestamps'. Below is a list of currently planned metrics. If a counter is specified at a more precise time level, expect it to be aggregated up into the next higher level.

  • yyyy-mm-ddThh:mm
    • counters:submitted_crash_reports
    • counters:submitted_crash_reports_legacy_throttle_0 -- ACCEPT
    • counters:submitted_crash_reports_legacy_throttle_1 -- DEFER
    • counters:submitted_crash_reports_legacy_throttle_2 -- DISCARD
    • counters:submitted_crash_report_hang_pairs
    • counters:submitted_oop_plugin_crash_reports (similar columns for future oop crash types)
    • counters:crash_report_processing_errors
    • unprocessed_crash_report_queue_size (Was thinking a metric for oldest item in queue would be handy, but it isn't a "counter" per se so it should have its own column family)
  • yyyy-mm-ddThh
    • unique_crash_signatures (NOTE: This number would likely be recalculated via a MapReduce job for higher time levels to avoid having to store every level of unique count during processing time)
  • yyyy-mm-dd
    • firefox_active_installations (similar columns for other supported products?)
  • crash_report_queues
    • counters:inserts_unprocessed
    • counters:deletes_unprocessed
    • counters:inserts_unprocessed_priority
    • counters:deletes_unprocessed_priority
    • counters:inserts_unprocessed_legacy
    • counters:deletes_unprocessed_legacy
    • counters:inserts_processed_priority
    • counters:deletes_processed_priority
    • counters:inserts_processed_legacy
    • counters:deletes_processed_legacy

Table crash_report_signatures

The rowkeys of this table will consist of solely the signature calculated by the Python crash report processor. There are two special values:

  • ##empty## 
    The generated signature was an empty string.
    ##null## 
    The processor failed and there was no generated signature.

There is currently a single column family in this table, 'counters:'. Below is a list of currently planned metrics. Counters for all time levels will be incremented at the same time because it is the most efficient implementation. Additionally, there will be logic that will reach out to the metrics table and increment the unique_crash_signatures record for each time interval if this is the first time a crash report with this signature was seen for that interval.

  • hourly_yyyy-mm-ddThh - Columns for current + 48 previous
  • daily_yyyy-mm-dd - Columns for current + 30 previous
  • monthly_yyyy-mm - Columns for current + 2 previous (NOTE: My thoughts are calendar months are less useful for trend comparison due to # of days differences and signature patterns are likely not very relevant over longer time periods due to new application versions. Disagree?)
  • yearly_yyyy - Columns for current + 2 previous years

Open Questions

  • Are the suggested expirations of the signature metrics sufficient?
  • Are there any time levels we should drop entirely due to lack of potential uses?
  • We should have no problems storing between 1 and 1,000 columns in one of these column families. As such, we could also plan for having metrics regarding the number of CRs per product, product+version, OS, etc. We need to be reasonable, but we shouldn't leave anything important out.