Report duplicates

From MozillaWiki
Jump to: navigation, search

Discussion for marking (and eventually removing) duplicate reports from socorro:

bug 579136

Things we care about

  • present reports with and without duplicate counts
  • show a duplicate flag
  • show how many probable duplicates a report has
  • filter out duplicates and remove them once we're confident

Things we don't care about

  • Selecting an "original" report which is the very first in clock time (as opposed to seconds or minutes later)
  • Having duplicates flagged immediately on hitting the database
  • Having more than one duplicate-finding algorithm at the same time

Short Term

  • Duplicates stored in a separate partitioned table, report_duplicates
  • Duplicates gathered by an asynchronous process which runs a few minutes after the reports come in (how?)
  • Duplicate count stored ... where? In the reports table?
  • All matview have two versions, matview and matview_dedup.

Long Term

  • Requires radical changes to the reports table.
  • Duplicate reports don't get stored, we just increment the dup_count column.
    • Or maybe we just don't display or count them?
  • Only one set of matviews.

Stuff to think about

  • When we go to full-throttle processing, # of probable duplicates will go up.
  • If we don't store processed duplicates, then we can't recover them later.