CrashKill/WorkWeek2012/Socorro

From MozillaWiki
Jump to: navigation, search

Socorro Review

Things on the work queue

  • Rapid betas
  • Opening up the API
    • We can write our own stuff - REST api
    • This is almost done - limited to crash analysis
    • Not dependent on elastic search stuff
    • Give us an api to use and can proceed to rewrite the front-end
    • Targeted for Q3
  • Configman
    • Socorro consists of a bunch of standalone and server apps
    • Other companies interested in using Socorro
    • Some people who adopt Socorro don't want to use HBASE
    • Wrote a system that unifies all the different config methods in Python
    • You can specify that a particular system can choose to use a specific Python class
    • You can specify what back-end components you want to use. Makes it more flexible for others to use ie: people that have like 10 crashes today.
    • In dev for a year - deliver a processor and monitor that uses this system within 2 weeks. Run the old and new processors in parallel for a while
    • All apps using config man within 6 mos
  • Chron tabber
    • Socorro has lots of chron jobs that run
    • ADUs from metrics, matching up bugs and signatures
    • Had a few problems - failures
    • Self-healing - knows when it last worked. Understands how many times it's failed.
    • Shouldn't need manual intervention
    • Building on ConfigMan
    • The chron jobs themselves are just a list of names/scripts
    • Motivation - write chron jobs that are dependent on other chron jobs
    • It enables us to write chron jobs to generate our own correlation reports
  • JSON mini dumps
    • Most of the code written
    • Use of this will be enabled by ConfigMan
  • Symbol storage
    • Everything stored as flat files
    • 2 terabytes of data storage
  • Queuing
    • Don't have a proper queuing system to queue stuff for processing
    • Right now we use a hack
    • Invested a bunch of time investigating how to do this better
    • Most queues are designed to store smaller stuff than crashing
    • Considering - HazelCast - Metrics uses Begera - cluster queue
    • Services working on building a queue - support notifications in FF. Building with Socorro requirements in mind.
    • Need boxes to set it up on
    • Probably move to this queuing system by the end of the year.
    • This will all be ConfigMan-able.
  • B2G
  • WebRT
  • Better admin UI
  • New reports
    • Correlation reports
    • Explosiveness reports

Rapid Betas

  • Do we want the old reporting? Probably yes. We will likely need the old style reporting.
  • Daily crashes - crashes by crash date by product, version and time window
  • Only count crashes for 7 days after the build
  • Working on db stuff, working on UI, still need to do the middleware.
  • 3 main reports/view
    • by build id (1, 3, 7 day)
    • by crash date
    • by beta

Signature generation

  • Using signatures for bucketing
  • When mini-dump stack walk runs, we get the pipe dump.
  • Each line of stack dump, line number, address, name of module, function name
  • Sometimes we don't have some info - grab what we have
  • Limited at 250 characters for signature

Elastic Search

  • Should be much faster - first implementation will be behind existing UI
  • Will be doing a whole new UI
  • Full text indexing so can search on anything
  • Worked on some interfaces for the search UI - interactive search UI
  • Save searches by copying URL
  • Still blocked on IT on more things. We have boxes but we can't get them to work correctly
  • Will have postgres and elastic search available for some time so we can switch if something fails
  • Quite a few open bugs tagged with "search" that are all blocked on elastic search
  • Want to be able to search on different levels of stack frames
  • Want to be able to group stacks together and search for the top 2-3 signatures appearing together
  • OOM allocation size - want to be able to look at all the crashes in each release, look for all where OOM allocation size is below a certain amount show a histogram by allocation size. Can look at the ones that have variable vs same allocation size. Ben will file a bug for that.
  • Flash crash - always off the main thread. Want to know what is happening on the main thread. Create a separate signature and bucket this by the one on the main thread.