Socorro/Speeding Up Postgres Search

The Way It Is Now

Currently many common searches for Advanced Search (/query) return extremely slowly, and some of them time out. The classic case is "All Firefox Crashes in the last 3 weeks"; while this result is paginated, the enumeration of these crashes takes quite a while.

Our long-term plan to fix this has been deploying Elastic Search. However, at this point we still don't have a trustworthy online date for ES. On top of which, many public socorro users may not want to deploy ES on top of Postgres and Hbase. As such, it makes sense for us to try to speed up Postgres search. What follows is a multi-step approach to make Postgres-based search much faster.

Phase I: Quick Fixes

These are all things we could do immediately to speed up search, as opposed to Phase II items which would need to wait on other things.

Indexing

Currently the reports table's indexes are not at all optimized for the searches we actually run on it. For one thing, because every query includes a date range, most indexes should be composite indexes with date_processed as the second column. Looking at the table usage stats shows that most of the existing indexes are not being used.

The first step of this would be to create and capture a set of queries which represent typical searches. The second step would be to manipulate the indexes on the reports table and run timing tests on all of those searches.

Personnel Required: Josh, rhelmer or mbrandt
Time: 8-12 hours for Josh, 2-4 hours for other staff
Gain: Minority speedup on most searches. If combined with Approximate Pagination, though, gains could be up to 25X speedup on current pathological searches.
Drawbacks: development time required due to slow pace of testing new indexes on DevDB. Would be speeded up by move to PHX.

Queueing Searches

Lonnen has a proposal to queue search requests and only run a specific number of them concurrently. This would fix server saturation due to reloads or screen-scraping, as well as allow us to avoid timeouts on long-running searches.

Personnel Required: Lonnen, mbrandt
Time: ???
Gain: improved user experience, avoidance of bottleneck behavior.
Drawbacks: Might increase user-perceived response time for reports which are currently quite fast.

Approximate Pagination

Currently the queries which mostly time out are not the data-returning queries but the COUNT(*) queries used to determine pagination. The standard way to solve this is to use approximate pagination, where we do not actually run an exhaustive search to determine how many results there are.

The way this would work is:

user submits a search
mware runs an EXPLAIN to estimate the number of rows returned from the query.
if the number is small, mware runs an full COUNT(*)
if the number is large, mware uses the estimate as if it were the real count

The main problem with this is that the user could potentially navigate to pages of results which don't exist. In this case, the interface would need to automatically renumber the pages and shift the user to the last actual page. The second problem is that the estimate could be too small and the user could be unable to navigate to pages of data which do exist. In that case, Socorro would need to check for additional pages of data by selecting one row from the next page when the user gets to the last page; if it exists, create another page number.

Since we do not currently display a count of results to the user, this would not need to be changed.

Personnel Required: Josh, Brandon/Lonnnen/Adrian (TBD)
Time: DB portion: 3-5 hours. mware/UI: ???
Gains: would improve performance of reports/list as well as search. Possible elimination of timeouts. Lowered load on DB server.
Drawbacks: confusing-to-user pagination behavior.

Phase II: Long-Term Fixes

Elastic Search

See above.

reports_clean

The Socorro Postgres database now contains a normalized, compact version of the Reports table called "reports_clean". On some sample tests, pathological searches which take over a minute against reports complete in less than 5 seconds against reports_clean. Also, since the data in reports_clean is cleaned up and normalized, the code and queries required to search it are much simpler, improving response times and code maintenance. This would allow us to add new search features faster.

Further, if reports_clean was the main source of search as it is now the main source of aggregate repots, it would be possible to retain only 2 weeks of "raw" data (reports, plugins, extensions) in the Postgres socorro database. This would shrink the database by 60% or more, making it much more portable and requiring less IT resources. Yes, we'd finally "get rid of" the reports table, using it only as a buffer for newly landed reports.

Therefore, a way to speed up Postgres search by as much as 15X is to move search to reports_clean instead of reports. However, there are a number of obstacles to be overcome first, which is why this is not an immediate solution. Obstacles and plans to overcome them detailed below.

People required: Josh, mware developer, UI developer, Lars.
Time required: difficult to estimate, see Obstacles
Gain: 15X speedup, plus code simplifcation and storage savings.
Drawbacks: See Obstacles below

obstacle 1: reports_clean lag

Currently reports_clean updating is an hourly batch job, trailing between 2 and 3 hours behind newly received crashes. While this schedule is fine for daily aggregates, it's not adequate for developers wanting to search on just-sent-in crashes. Search needs to be at most a few minutes behind.

There are a couple of reasons we do batch processing of reports_clean, however, and those would need to be addressed in any new processing system:

update_reports_clean updates a number of shared lookup lists which would become a source of locking contention in any parallel, crash-at-a-time processing method;
1. How do you avoid causing all the processors to grab locks when a new thing (new version of Flash, new version of windows) shows up?
at peak load, processors consume 100% of cpu, so would need more processors;
batch processing is much more efficient than crash-at-a-time processing;
crashes sometimes arrive well out of date_processed order and need to not be missed (concern has to do with behavior of materialized views/aggregates, and data that arrives late).

Given issue 1, above, having the processors handle generating reports_clean for each crash would be quite challenging; we've been removing points of locking contention from the processors over the last year and would be loathe to add any. Thus, an alternative approach is called for:

As each crash is processed and inserted into the Postgres database, the processors, or a trigger, insert the UUID into a queue (using Postgres pgQ).
Every 1000 crashes, or every minute (whichever comes first), a single-threaded batch process fires up and processes the pending crashes out of the queue into reports clean.
An additional clean-up batch process runs once an hour to look for crashes in the last 6 hours which were somehow missed and adds them to reports-clean.

This approach allows us to keep the benefits of batch processing, while being no more than a few minutes behind on search.

Testing a bursty instance of adding "a new version of Flash":

fire up 20 processors that hit Stage
Have it try to add XXXX new crashes with the new thing
See what falls over!

obstacle 2: exclusion of old releases

Due to the change in metadata structure, reports_clean currently only includes "new" releases, that is, post-rapid-release versions. Crashes for, for example, Firefox 3.6.25, are not included in reports_clean.

The only way to solve this is to outwait it. 3.6 goes EOL in April. At that point, supporting it in Socorro is no longer a concern.

obstacle 3: semi-garbage data

Since reports_clean is a normalized table, it has to exclude any crashes which don't fit with the established specification of what is proper crash data. For example, an invalid release channel or beta buildID will cause a crash to be excluded from reports_clean. Currently, such garbage crashes are still visible via advanced search, but if we move search to reports clean, they would not be.

The problem is that sometimes developers and analysts need to search the garbage crashes to determine various things, such as why the crash reports are broken. Also, release engineering and other teams sometimes make arbitrary and unheralded changes to release and/or breakpad data which can cause crashes to be temporarily seen as invalid by the system when they are not.

Dealing with this will require a multi-stage approach:

The first part of dealing with this obstacle would be to keep 2 weeks of raw data in the Postgres database, allowing us to manually search and reprocess garbage crashes for up to 2 weeks.
A second part would be to do regular monitoring on the number of excluded crashes, and send an alert whenever the number spikes, indicating an upstream problem. A weekly report giving breakdowns of excluded crashes would also be helpful.
Thirdly, we should provide analysts with some method of searching HBase in order to do analysis on garbage crashes.
Optionally, we could provide a simplified search interface for searching the raw reports data in postgres for the last 2 weeks.

obstacle 4: additional search fields

Once reports_clean becomes our canonical source of data, we need to include in it all fields and dependant information which might need to be searched. While the majority of existing search fields are represented in reports_clean already, there are a few which are not and would need to be dealt with. This will use up some of the storage space we are saving by truncating the raw data tables.

Branch: branches are deprecated, and this does not need to be supported. This would be removed from the interface.
Plugins: reports_clean would need to have a normalized child table, plugins_clean, which would be maintained with reports_clean.
Extensions: as plugins, above.

We would also need to add some additional indexes to support some of the search fields.

PostgreSQL 9.2

PostgreSQL 9.2 will support index-only searches. This will speed up many searches against the Postgres database by up to 10X without changes to application code.

Since 9.2 is not even in beta yet, this strategy cannot be utilized until September 2012.

Socorro/Speeding Up Postgres Search

Contents

The Way It Is Now

Phase I: Quick Fixes

Indexing

Queueing Searches

Approximate Pagination

Phase II: Long-Term Fixes

Elastic Search

reports_clean

obstacle 1: reports_clean lag

obstacle 2: exclusion of old releases

obstacle 3: semi-garbage data

obstacle 4: additional search fields

PostgreSQL 9.2

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools