Discussion for marking (and eventually removing) duplicate reports from socorro:
Things we care about
- present reports with and without duplicate counts
- show a duplicate flag
- show how many probable duplicates a report has
- filter out duplicates and remove them once we're confident
Things we don't care about
- Selecting an "original" report which is the very first in clock time (as opposed to seconds or minutes later)
- Having duplicates flagged immediately on hitting the database
- Having more than one duplicate-finding algorithm at the same time
- Duplicates stored in a separate partitioned table, report_duplicates
- Duplicates gathered by an asynchronous process which runs a few minutes after the reports come in (how?)
- Duplicate count stored ... where? In the reports table?
- All matview have two versions, matview and matview_dedup.
- Requires radical changes to the reports table.
- Duplicate reports don't get stored, we just increment the dup_count column.
- Or maybe we just don't display or count them?
- Only one set of matviews.
Stuff to think about
- When we go to full-throttle processing, # of probable duplicates will go up.
- If we don't store processed duplicates, then we can't recover them later.