Socorro:HBase: Difference between revisions

Socorro:HBase (view source)

Revision as of 05:57, 8 May 2010

1,665 bytes added , 8 May 2010

no edit summary

DEinspanjer

131

edits

@@ Line 1: / Line 1: @@
-==Socorro HBase Metrics==
+=General HBase information=
-Metrics are generated during the life-cycle of crash reports.  While this schema will feel very odd to people more familiar with a traditional SQL method, it is believed to be performant for HBase where a single counter column can easily be atomically incremented up to thousands of times per second.
+==Required Reading==
+* [http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture HBase architecture]
+* [http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable Understanding HBase and BigTable]
+* [http://labs.google.com/papers/bigtable.html Google's BigTable Whitepaper]
+==Useful Links==
+* [http://wiki.apache.org/hadoop/Hbase/ Apache HBase Wiki]
+*; irc.freenode.net #hbase : Very friendly channel with lots of knowledgeable people
+* [http://hadoop.apache.org/hbase/mailing_lists.html HBase mailing list]
+* ... ? (Add some)
+==Notes==
+* Socorro uses the [http://wiki.apache.org/hadoop/Hbase/ThriftApi Thrift API] to allow the Python layer to interact with HBase.  Every node in the cluster runs a Thrift server and they are all part of a VIP that only production servers can access.
+* Column families
+** Example:  A common column family Socorro uses is "ids:" and a common column qualifier in that family is "ids:ooid".  Another column is "ids:hang"
+** The table schema enumerates the column families that are part of it. The column family contains metadata about compression, number of value versions retained, and caching.
+** A column family can store tens of thousands of values with different column qualifier names.
+** Retrieving data from multiple column families requires at least one block access (disk or memory) per column family.  Accessing multiple columns in the same family requires only one block access.
+** If you specify just the column family name when retrieving data, the values for all columns in that column family will be returned.
+** If a record does not contain a value for a particular column in a set of columns you query for, there is no "null", there just isn't an entry for that column in the returned row.
+* Manipulating a row
+** All manipulations are performed using a rowkey.
+** Setting a column to a value will create the row if it doesn't exist or update the column if it already existed.
+** Deleting a non-existent row or column is a no-op.
+** Counter column increments are atomic and very fast.  StumbleUpon has some counters that they increment hundreds of times per second.
+* Tables are always ordered by their rowkeys
+** Scanning a range of a table based on a rowkey prefix or a start and end range is fast.
+** Retrieving a row by its key is fast.
+** Searching for a row requires a rowkey structure that you can easily do a range scan on, or a reverse index table.
+** A full scan on a table that contains billions of items is slow (although, unlike an RDBMS it isn't likely to cause performance problems)
+** If you are continually inserting rows that have similar rowkey prefixes, you are beating up on a single RegionServer.  In excess, it is unpleasant.
-The usage of the ''metrics'' table involves requesting the desired range of timestamp rows at whatever level you wish (daily, hourly, etc.) and then generating a chart or report based on the counter values found in each row.
+=Socorro HBase Schema=
+==Table ''metrics''==
+===records containing aggregate metrics for varying time intervals===
+;yyyy : yearly
+;yyyy-mm : monthly
+;yyyy-mm-dd : daily
+;yyyy-mm-ddThh : hourly
+;<nowiki>yyyy-mm-ddThh:mm</nowiki> : per minute
+===special records===
+;crash_report_queue : contains metrics about the current state of the processing queue
-The usage of the ''crash_report_signatures'' table involves looking up a particular signature and then iterating through a filtered list of the desired counters: columns to generate a chart or report.
-If you can think of additional useful counters or metrics please add them to this page with a date and your name so we know who to contact for details (and blame... I mean give credit)
-Currently, we have two tables mapped out which store metrics important to Socorro:
-===Table ''metrics''===
-The rowkeys in this table will consist of variable length timestamps.  Possible key lengths are:
-yyyy - yearly
-yyyy-mm monthly
-yyyy-mm-dd daily
-yyyy-mm-ddThh hourly
-yyyy-mm-ddThh:mm per minute  (I can't think of a use for less than per minute)
 Over time, we will expire finer grained records and roll them up into the next higher level.  We should have lots of room to grow and we can figure out what sort of expiration we wish to set on these.  The only argument for a strict expiration is that there are certain metrics which will only exist in the higher level stats (for instance, we wouldn't be able to generate hourly or per minute ADU numbers so those metrics would only exist in the daily records).
-There is currently a single column family in this table, 'counters:'. Below is a list of currently planned metrics. If a counter is specified at a more precise time level, expect it to be aggregated up into the next higher level.
+There are two column families in this table, 'counters:' and 'timestamps'. Below is a list of currently planned metrics. If a counter is specified at a more precise time level, expect it to be aggregated up into the next higher level.
 * yyyy-mm-ddThh:mm
-** submitted_crash_reports
+** counters:submitted_crash_reports
-** processed_crash_reports
+** counters:submitted_crash_reports_legacy_throttle_1 -- DEFER
-** crash_report_processing_errors
+** counters:submitted_crash_reports_legacy_throttle_2 -- DISCARD
-** submitted_crash_report_hang_pairs
+** counters:submitted_crash_report_hang_pairs
-** submitted_oop_plugin_crash_reports (similar columns for future oop crash types)
+** counters:submitted_oop_plugin_crash_reports (similar columns for future oop crash types)
+** counters:crash_report_processing_errors
 ** unprocessed_crash_report_queue_size  (Was thinking a metric for oldest item in queue would be handy, but it isn't a "counter" per se so it should have its own column family)
 * yyyy-mm-ddThh
@@ Line 31: / Line 57: @@
 * yyyy-mm-dd
 ** firefox_active_installations (similar columns for other supported products?)
+* crash_report_queue
+** counters:current_unprocessed_size
+** counters:current_legacy_unprocessed_size
+==Table ''crash_report_signatures''==
+The rowkeys of this table will consist of solely the signature calculated by the Python crash report processor. There are two special values:
+*; ##empty## : The generated signature was an empty string.
+*; ##null## : The processor failed and there was no generated signature.
-===Table ''crash_report_signatures''===
+There is currently a single column family in this table, 'counters:'. Below is a list of currently planned metrics.  Counters for all time levels will be incremented at the same time because it is the most efficient implementation.  Additionally, there will be logic that will reach out to the ''metrics'' table and increment the unique_crash_signatures record for each time interval if this is the first time a crash report with this signature was seen for that interval.
-The rowkeys of this table will consist of solely the signature calculated by the Python crash report processor. (TODO: Fill in detail on what key will be used for CRs that failed to generate a signature)
-There is currently a single column family in this table, 'counters:'. Below is a list of currently planned metrics.  Counters for all time levels will be incremented at the same time because it is the most efficient implementation.  Additionally, there will be logic that will reach out to the ''metrics'' table and increment the unique_crash_signatures record for an hour if this is the first time a crash report with this signature was seen for that hour.
 * hourly_yyyy-mm-ddThh - Columns for current + 48 previous
-* daily_yyyy-mm-dd - Columns for current + 14 previous
+* daily_yyyy-mm-dd - Columns for current + 30 previous
 * monthly_yyyy-mm - Columns for current + 2 previous (NOTE: My thoughts are calendar months are less useful for trend comparison due to # of days differences and signature patterns are likely not very relevant over longer time periods due to new application versions.  Disagree?)
 * yearly_yyyy - Columns for current + 2 previous years
 ===Open Questions===
-* Format of timestamps in column names
-** Do we prefer using the T separator between date and time? (I found this being done in the submitted_timestamp value of the CR metadata so I figured it might be preferred.)
-** What about a colon between hour and minute?
-** Using the W prefix for yyyy-week seems necessary to prevent confusion with the first week of year and January of the year. That said, the column already has the prefix 'weekly' in it. Thoughts?
 * Are the suggested expirations of the signature metrics sufficient?
 * Are there any time levels we should drop entirely due to lack of potential uses?
 * We should have no problems storing between 1 and 1,000 columns in one of these column families.  As such, we could also plan for having metrics regarding the number of CRs per product, product+version, OS, etc.  We need to be reasonable, but we shouldn't leave anything important out.

Socorro:HBase: Difference between revisions

Socorro:HBase (view source)

Revision as of 05:57, 8 May 2010

Navigation menu

Search