CrashKill/WorkWeek2012/Socorro
From MozillaWiki
< CrashKill | WorkWeek2012
Contents
Socorro Review
Things on the work queue
- Rapid betas
- Opening up the API
- We can write our own stuff - REST api
- This is almost done - limited to crash analysis
- Not dependent on elastic search stuff
- Give us an api to use and can proceed to rewrite the front-end
- Targeted for Q3
- Configman
- Socorro consists of a bunch of standalone and server apps
- Other companies interested in using Socorro
- Some people who adopt Socorro don't want to use HBASE
- Wrote a system that unifies all the different config methods in Python
- You can specify that a particular system can choose to use a specific Python class
- You can specify what back-end components you want to use. Makes it more flexible for others to use ie: people that have like 10 crashes today.
- In dev for a year - deliver a processor and monitor that uses this system within 2 weeks. Run the old and new processors in parallel for a while
- All apps using config man within 6 mos
- Chron tabber
- Socorro has lots of chron jobs that run
- ADUs from metrics, matching up bugs and signatures
- Had a few problems - failures
- Self-healing - knows when it last worked. Understands how many times it's failed.
- Shouldn't need manual intervention
- Building on ConfigMan
- The chron jobs themselves are just a list of names/scripts
- Motivation - write chron jobs that are dependent on other chron jobs
- It enables us to write chron jobs to generate our own correlation reports
- JSON mini dumps
- Most of the code written
- Use of this will be enabled by ConfigMan
- Symbol storage
- Everything stored as flat files
- 2 terabytes of data storage
- Queuing
- Don't have a proper queuing system to queue stuff for processing
- Right now we use a hack
- Invested a bunch of time investigating how to do this better
- Most queues are designed to store smaller stuff than crashing
- Considering - HazelCast - Metrics uses Begera - cluster queue
- Services working on building a queue - support notifications in FF. Building with Socorro requirements in mind.
- Need boxes to set it up on
- Probably move to this queuing system by the end of the year.
- This will all be ConfigMan-able.
- B2G
- WebRT
- Better admin UI
- New reports
- Correlation reports
- Explosiveness reports
Rapid Betas
- Do we want the old reporting? Probably yes. We will likely need the old style reporting.
- Daily crashes - crashes by crash date by product, version and time window
- Only count crashes for 7 days after the build
- Working on db stuff, working on UI, still need to do the middleware.
- 3 main reports/view
- by build id (1, 3, 7 day)
- by crash date
- by beta
Signature generation
- Using signatures for bucketing
- When mini-dump stack walk runs, we get the pipe dump.
- Each line of stack dump, line number, address, name of module, function name
- Sometimes we don't have some info - grab what we have
- Limited at 250 characters for signature
Elastic Search
- Should be much faster - first implementation will be behind existing UI
- Will be doing a whole new UI
- Full text indexing so can search on anything
- Worked on some interfaces for the search UI - interactive search UI
- Save searches by copying URL
- Still blocked on IT on more things. We have boxes but we can't get them to work correctly
- Will have postgres and elastic search available for some time so we can switch if something fails
- Quite a few open bugs tagged with "search" that are all blocked on elastic search
- Want to be able to search on different levels of stack frames
- Want to be able to group stacks together and search for the top 2-3 signatures appearing together
- OOM allocation size - want to be able to look at all the crashes in each release, look for all where OOM allocation size is below a certain amount show a histogram by allocation size. Can look at the ones that have variable vs same allocation size. Ben will file a bug for that.
- Flash crash - always off the main thread. Want to know what is happening on the main thread. Create a separate signature and bucket this by the one on the main thread.