Changes

Jump to: navigation, search

User:Jesse/NewFrecency

2,711 bytes added, 23:42, 20 November 2011
Created page with "== Problems with the current algorithm == https://developer.mozilla.org/en/The_Places_frecency_algorithm The use of buckets and sampling cause the following problems: * '''Que..."
== Problems with the current algorithm ==
https://developer.mozilla.org/en/The_Places_frecency_algorithm

The use of buckets and sampling cause the following problems:

* '''Questionable predictions'''
** Non-monotonic: A series of low-value visits can make the computed frecency decrease. (In order to estimate the total value of all visits, the algorithm samples the visit types of the last 10 visits.)
** Discontinuous: There is no reason to believe something special happens 4 days after a visit.
* '''Performance'''
** It requires periodic recomputation, which I believe currently happens during an idle-daily event.
* '''Memory, privacy'''
** The algorithm requires storing storing details about multiple visits.

== Proposed new definition ==
The new definition is based on exponential decay. It simplifies the algorithm by removing the need for buckets, sampling, and recomputation. It improves the predictions by making them monotonic and continuous.

* ''Decay rate constant''
** '''λ''' ≡ (ln 2) / (30 days)
* ''Current value of a visit''
** '''current_visit_value''' ≡ visit_type_points * e^(-λ * visit_age)
* ''URL's frecency score''
** '''s''' ≡ ∑ current_visit_value ''for all visits''

Both visit scores and total frecency scores decay with a half-life of 1 month. If typing a URL is worth 2 points, then a typed visit a month ago is worth 1 point.

That takes care of the buckets and sampling. It could be implemented by storing the URL frecency score, and having an idle-daily job that multiplies all scores by (e ^ (-(ln 2) / 30)) = 0.977.

But with an additional trick, ''no recomputation is necessary''. The trick is to store in the database something with units of date.

* ''URL's time until s=1''
** '''t''' = (ln s) / λ
** s = e^(λt)

With this trick, we can just store '''now() + t''' in the database, and we're done.
* No recomputation is necessary. Decay just happens.
* Sorting by (current) frecency score is equivalent to sorting by reverse date (on which the frecency will be 1).
* Adding a visit just requires converting the date to a score, adding to the score, and converting the score back into date.
* It is no longer necessary to store the visit. If want want, we can just update the last-visit-date and frecency-one-date of the URL without adding another row to the database.

== What's kept from the current algorithm ==
* Visit type bonuses
* The half-life of 30 days is approximately taken from the current algorithm.

== Possible disadvantages of the new algorithm ==
* Requires floating point math
* Doesn't emulate the "slowing decay" of the current bucket weights.
** Could be emulated by having two exponential decays with different rates, if desired.
Confirm
729
edits

Navigation menu