User:Jesse/NewFrecency

From MozillaWiki
Jump to: navigation, search

Frecency is a measure that combines frequency and recency.

This page describes a frecency measure based on exponential decay, and a way to store and update this measure efficiently.

Problems with the old algorithm

https://developer.mozilla.org/en/The_Places_frecency_algorithm

The use of buckets and sampling cause the following problems:

  • Questionable predictions
    • Non-monotonic: A series of low-value visits can make the computed frecency decrease. (In order to estimate the total value of all visits, the algorithm samples the visit types of the last 10 visits.)
    • Discontinuous: There is no reason to believe something special happens 4 days after a visit.
  • Performance
    • It requires periodic recomputation.
      • Idle-daily recomputation already uses exponential decay (implemented in bug 476299)
      • On-visit recomputation does not use exponential decay.
  • Memory, privacy
    • The algorithm requires storing storing details about multiple visits.

Proposed new definition

See also http://mathb.in/708, which is the same idea, but perhaps a simpler way of going about it. See also also http://mathb.in/713, which describes how to do a frecency update without bignums.

The new definition is based on exponential decay. It simplifies the algorithm by removing the need for buckets, sampling, and recomputation. It improves the predictions by making them monotonic and continuous.

  • Decay rate constant
    • λ ≡ (ln 2) / (30 days)
  • Current value of a visit
    • current_visit_value ≡ visit_type_points * e^(-λ * visit_age)
  • URL's frecency score
    • s ≡ ∑ current_visit_value for all visits

Both visit scores and total frecency scores decay with a half-life of 1 month. If typing a URL is worth 2 points, then a typed visit a month ago is worth 1 point.

That takes care of the buckets and sampling. It could be implemented by storing the URL frecency score, and having an idle-daily job that multiplies all scores by (e ^ (-(ln 2) / 30)) = 0.977.

Efficient computation

But with an additional trick, no recomputation is necessary. The trick is to store in the database something with units of date.

  • URL's time until s=1
    • t = (ln s) / λ
    • s = e^(λt)

With this trick, we can just store now() + t in the database, and we're done.

  • No recomputation is necessary. Decay happens implicitly as t is stored and computed relative to different now() dates.
  • Sorting by (current) frecency score is equivalent to sorting by reverse date (on which the frecency will be 1).
  • Adding a visit just requires converting the date to a score, adding to the score, and converting the score back into date.
  • It is no longer necessary to store the visit. If want want, we can just update the last-visit-date and frecency-one-date of the URL without adding another row to the database.

What's kept from the current algorithm

  • Visit type bonuses
  • The half-life of 30 days is approximately taken from the current algorithm.

Possible disadvantages of the new algorithm

  • Requires floating point math
  • Doesn't emulate the "slowing decay" of the current bucket weights.
    • Could be emulated by having two exponential decays with different rates, if desired.

Implementation