Breakpad/Design/Database

< Breakpad‎ | Design

Breakpad Database

ER draft: breakpad-er.jpg

Note that many of these look like 1-to-1 relationships but it means that per-report it has one value. Across many reports, values vary but only slightly. There would be enormous amounts of duplication for the 1-to-1 relationships in the diagram, which was why they were not kept as attributes of the report entity.

Notes

  • Need bug numbers (will store in separate table after M1)
    • Relate to crash signature -- but what is a crash signature?
    • If there is a function name, then it's the function name
    • If not, it's the source line
    • If not it's the module name/address
    • bsmedberg
  • Need to store extra frames in a blob (store > 10 frames in dumps table)
  • Need extensions and DLLs stored somewhere (will store in separate table after M1)
    • ext id, version number
    • dll's -- filename, version, base addr, end addr, whether it was the main module of process -- varies by platform
  • Report fields to add
    • Time since last crash
    • Time since install

New Questions

  • Stack hash to uniquely identify frames and their association to a report
  • Whether or not to throw frame information into text field -- are we doing to have to do detailed reporting across frames
  • Subsequent partitioning of frames table once you get past reports
  • How many frames on average do we have per report?
    • About 10 frames per reports, then store the rest in a blob somewhere for the rest of the frames
  • What are the attributes for frame -- what type of data? What is the avg size of frames? Look for arbitrarily large info.
  • What kind of field is CPU? How much do these vary?

Not sure about:

  • How much we need to break apart the frames table?
    • What you have here looks about right, imho. - Crowder
  • How normalized do we want things?
    • Err on the side of more rather than less. - Crowder
  • For id->name pairings is it worth pulling the values out since data will be duplicated heavily for things like product, platform, os, etc.? To reduce the size of the main reports table I think it's necessary but I'm open to being convinced otherwise.
    • For things where you can save time and space, normalize here, otherwise don't. - Crowder
  • Do comment and reason need to be separated from the report table as well?
  • Need some additional tables not talked about
    • Crash groupings by address, module, function?
    • For each of these groupings -- bugs open, bugs closed table
    • Stored bugzilla queries in each of these groupings pointing to a query that brings up all related bugs
    • A table (or set of tables) recording all the DLLs associated and captured with each report (perhaps DLL ID/version, then normalize DLL names to elsewhere) Crowder

PostgreSQL 8.2 or higher should be used because it offers things that MySQL does not have:

  • Partitioning if we need to scale the database in the future, which will likely happen if we plan on keeping more than 3 months worth of data. For example, we could partition by date and have the db create a new child table for each new month as time goes on, then pop off outdated partitions once they become too old but keep them in another db for since-beginning-of-time queries. Popping the old tables off the "stack" could be done using a backend script/procedure/cron -- whatever.
  • Arguably better performance for larger data sets (read Google if you want to, but most of this lies in partitioning and bitmask joins)
  • More precise datatypes (adhering to SQL standards is a plus)
  • Long-standing support for stored procedures
  • Views for databases that are heavily normalized (helps in simple reporting!)

Designing the schema is an important task and shouldn't be taken lightly. In general, there are typically two routes to take:

  • Normalize, come up with a snowflake-like schema, abstract up-front but better long-term
  • Leave denormalized, just swallow inefficiencies in the short term and pay the price later on
    • As mentioned above, I vote for the former. - Crowder

How we can have both:

  • Normalize properly, generating a properly abstracted set of tables with partitions
  • Use views to pull together common queries used for reporting

The current schema has:

  • individual minidump records
  • product/OS/buildid information
  • control information used to "register" an incoming minidump and monitor/track/control its progress though all the stages from being transfered from the client until it is fully collected and processed on the server and available for reporting and analysis.

A major concern about over-normalization is data insertion and the problems it presents for the processor, which should probably be transactional and may have to do lookups and some db magic in order to successfully insert new records.

  • Should be able to cache repetitive stuff like OS in whatever tool is talking to the DB to save it some work, if necessary - Crowder

The lookups would be on stuff like product name, build, platform, os, locale, url, etc. I am interested in getting all this stuff out of the main table mainly because there is going to be a ridiculous amount of duplication.

I think there are two options here:

  • Just dynamically add non-existent data if it hasn't been previously recorded, which could end up creating a bunch of messed up records depending on what we get back from clients
  • Ignore stuff we haven't already accounted for, which ensures ACID but could eliminate things we "forgot" to prefill

I am not sure if this is the same problem as below -- I'd like to talk more about this in the meeting (this was on this page from before):

  • Using the same database and schema for the minidump data and also for control over the processing of the minidumps has created problems in the past. when coruption is introduced, database performance problems crop up, or maintenance is required to manage the collection of minidumps, the processing of incoming minidumps is hindered and we start into a downward spiral of compounding problems. We should do some thinking to see if there are ways to improve on the existing design.

Does Address need to be separated out? How unique is it in our existing crash reports?

We know we'll need a separate table to store the list of extensions installed per crash report.

We also may want to keep track of which DLLs belong to which products in a separate table. This will allow us to identify frankenfoxes, as well as specific plugins.