User:Sspitzer/GlobalFrecency

From MozillaWiki
Jump to: navigation, search

where to find the builds

https://build.mozilla.org/tryserver-builds/2008-01-10_16:09-sspitzer@mozilla.com-1200010016/

Big picture view of the changes

1) unvisited bookmarks will appear in url bar autocomplete (ac) results. this means if you clear private data, if you have bookmarks, url bar ac still has something to return upon typing.

2) results are sorted by frecency globally, which should increase the quality of the results. for example:

if a highly ranked (by frecency) site was visited more than 4 days ago, unlike before, it will appear in the first set of results. so say you visit http://wiki.mozilla.org/WeeklyUpdates once a week for the past year, but it's been 7 days since your last visit. before, because it was more than 4 days old, it would not be in the first chunk of results.

Additional details

1) Unlike before, unvisited bookmarks should appear in ac results

2) Unvisited livemark items should not appear, nor should "place:" URLs.

3) Upon first run, I have to add the frecency column to the moz_places table, create an index for that column, and make sure that livemark items and "place:" urls get a frecency of 0, and it should not appear in the ac results. Another way for a place to have a 0 frecency is if the url only has "embedded" visits.

5) If I don't know the frecency of a place, the value is -1. This is what I call an "invalid" frecency. If something has an invalid frecency, it will show up in the ac results.

6) the url bar drop down shows "typed" sites, ordered by frecency descending.

7) When inserting a bookmark, we attempt to calculate a frecency for it. This will impact the performance of bookmark import and also fx 2 - > fx 3 migration. (spin off bug coming about how to deal with it.)

8) for how we calculate a frecency for a site, see http://wiki.mozilla.org/User:Mconnor/PlacesFrecency (and option 3). I use the 10 most recent visits, and this is pref controlled, as all are the buckets, weights and bonus values.

9) if we don't have any visits for a site, I make an attempt to estimate the frecency. more on this in a spin off bug (including when we should estimate and when we should not.)

10) After only 5 seconds of idle, I'll attempt to recalculate the top 100 invalid frecencies (ordered by visit count) and the top 100 "old" frecencies (order by frecency descending but the last visit date is older than 4 days). 4 days is not hard coded, it's really older than the first "bucket".

The 5 second number will increase and the 100 number will decrease before I land.

11) after clearing all private data, all remaining places get the frecency of -1, except for livemark items and "place:" urls (which should get frecency 0).

current list of prefs to tune

The names and default values will most likely be changing. I wouldn't recommend tuning these yet. I'll elaborate on what each pref does later on in this document.

// the (maximum) number of the recent visits to sample
// when calculating frecency
pref("browser.frecency.numVisits", 10);

// buckets (in days) for frecency calculation
pref("browser.frecency.firstBucketCuttoff", 4);
pref("browser.frecency.secondBucketCuttoff", 14);
pref("browser.frecency.thirdBucketCuttoff", 31);
pref("browser.frecency.fourthBucketCuttoff", 90);

// weights for buckets for frecency calculations
pref("browser.frecency.firstBucketWeight", 100);
pref("browser.frecency.secondBucketWeight", 70);
pref("browser.frecency.thirdBucketWeight", 50);
pref("browser.frecency.fourthBucketWeight", 30);
pref("browser.frecency.defaultBucketWeight", 10);

// bonus (in percent) for visit transition types for frecency calculations
pref("browser.frecency.embedVisitBonus", 0);
pref("browser.frecency.linkVisitBonus", 120);
pref("browser.frecency.typedVisitBonus", 200);
pref("browser.frecency.bookmarkVisitBonus", 140);
pref("browser.frecency.downloadVisitBonus", 0);
pref("browser.frecency.permRedirectVisitBonus", 0);
pref("browser.frecency.tempRedirectVisitBonus", 0);
pref("browser.frecency.defaultVisitBonus", 0);
// bonus (in percent) for place types for frecency calculations
pref("browser.frecency.unvisitedBookmarkBonus", 140);
pref("browser.frecency.unvisitedTypedBonus", 200);

Main Bug

See https://bugzilla.mozilla.org/show_bug.cgi?id=394038

Spin off bugs

  • bug 412110 - don't bother calculating frecencies for place: urls (they should be 0)
  • bug 411591 - expose frecency as a sort order for place queries
  • bug 411845 - implement browser.frecency.version pref so that we can reset all the frecencies when we change / tune the algorithm
  • bug 412730 - problems reusing autocomplete results due to bookmark titles
  • bug 412734 - bookmarked visited livemark items might not be starred in autocomplete results
  • bug 412736 – in the case of a frecency tie, break it with h.typed and h.visit_count
  • 000000 - make frecency index a multi column index with typed, and visit_count?
  • 000000 - forcibly call the "on idle" timer after migration, import, page removal, etc
  • bug 412217 – treat TRANSITION_DOWNLOAD like we treat TRANSITION_EMBED visits


Additional notes (need to clean these up, log spin off bugs, etc). These are mostly for dietrich so he knows what I've done and why and the known issues.

0)

<dietrich> / something bookmarked and typed will have a higher frecency than <dietrich> / something just typed or just bookmarked. <dietrich> you mean here that something bookmarked/typed longer ago is weighted more than something recently bookmarked/typed?

We might call nsNavHistory::CalculateFrecencyInternal() with a place id of -1, which means the place doesn't exist yet.

One example of this when inserting a bookmark.

1)

unvisited bookmarks are no longer hidden, but unvisited livemark items are. this happens in nsNavBookmarks::InsertBookmark(), where we call UpdateFrecencyAndHiddenForBookmark(), but only if the parent is not a livemark

2)

when you remove a bookmark, should the frecency change for that place? yes, it should, as we gave weight to bookmarks. RemoveItem() calls UpdateBookmarkHashOnRemove() with the place id, so we do this work there. Note, if you remove a folder, we will eventually call RemoveItem(), so that's how this works with removing folders.

3)

spin off bug:

// UpdateBookmarkHashOnRemove() does a sanity check using 
// IsBookmarkedInDatabase(),  so it might not have actually
// removed the bookmark.  should we have a boolean out param
// for if we actually removed it, and use that to decide if we call
// UpdateFrecency() and the rest of this code?

4)

on calculating frecency, if we don't know ahead of time, we need to check that for a place id, if it is bookmarked, but not a livemark. see CalculateFrecency() [not CalculateFrecencyInternal()]

5)

spin off bug:

upon moving a bookmark, update the frecency. for unvisited bookmarks, this bumps the frecency, because we use PR_Now() as the pseudo visit, which is desired, as we're expressing interest. For visited bookmarks, it may age the bookmark and lower frecency. Is that desired? for now, punt on this and for annotations for updating frecencies.

5)

issue: if when autocompleting, we will prefer moz_place title over bookmark title, and so when showing match, we will show that one. because of the fix for bug #407292 – When adding a bookmark with no title, we should use the uri as the title, you might get the uri as the title if it matches the user text, as we prefer it.

6)

spin off bug:

we recalc twice on idle, one for invalid and one for old frecencies. need marco's last_visit_date change to make old work right, or do a subquery until then. right now, we are going to recalc the same places over and over. (if we do a sub query, watch out for performance)

recalculating for invalid (frecency == -1) is for firefox 2 / 3b2 -> b3

recalculating for old frecencies is for the ebay problem (https://bugzilla.mozilla.org/show_bug.cgi?id=394038#c19)

to solve the ebay problem, we should also recalc the moz_places with high frecency and "old" last visit date. (In another patch, marco has added last visit date to moz_places, which will make the query we need more efficient as we don't have to join against moz_historyvisits)

If something has a recent last visit date, then we recently recalculated frecency, so the frecency would be accurate.

If something has a low frecency and an old last visit date, recalculating would only lower the frecency (and low frecency wouldn't be hurting our autocomplete results.)

But if something has a high frecency and an old last visit date, recalculating the frecency would lower the frecency, which would solve the ebay problem.

7)

when updating places with frecency of -1, we call CalculateFrecency(), which will check if the place is a livemark item (only), and if so, use PR_FALSE for is bookmarked. we need to do all this so that upon calculating frecencies for previously hidden livemark items, we don't unhide them. (They should come back with frecency of zero if they were never visited.)

8)

in previous versions of the patch, I had hidden <> 1 in the check for invalid frecencies. we don't need to (nor want to) limit recalculations to non hidden. what we currently do is look for anything with frecency of -1, and recalc (on idle). if frecency goes to > 0, we can unhide it. this will cause old, hidden bookmarks to be unhidden and start showing up in url bar autocomplete, yet keep hidden, unvisited livemark items hidden.

note, we have had issues in the past where unvisited bookmarks are hidden. (see bug #369887)

with a new bookmark, we create a moz_place for it, with hidden = 1. if you were to visit that bookmark, we would set hidden to 0, even if you later cleared all visits.

for new bookmarks, we will be calculating the frecency (and unhiding) when we call InsertBookmark()

9)

I've made prefs for everything, so we can easily tune it. note, for download, and redirect transitions, and 0 (undefined, see bug #xxxxxx) the weight is 0. (will anything bad happen if we go negative? I don't think so.)

10)

when do we recalc frecencies for places I don't revisit? on expiration, but that could be a long time. we also recalc on idle (after first bucket days), ordered by high frecency.

11)

like unvisited livemark items, place: queries should not show up in autocomplete results. using IsQueryURI(), I check if the place url is place:..., and if so, pass in false for isBookmark to CalculateFrecency() so that they stay hidden until visited (or in the case of place: urls, always.) because the frecency is zero (passing in PR_TRUE for isBookmark would result in non-zero frecency), we will also no recalculate frecency on idle. only upon visit.

12) on clear all data, resetting frecency to -1

13) on expiration, I reset frecency to -1

on expiration (EraseVisits()), for every place we deleted visits, visit_count and frecency is reset.

in the code that recalcs frecency, if old frecency is -1, we recalc visit count and update it.

14) on delete visits, see nsNavHistory::RemovePage() and nsNavHistory::RemovePagesFromHost(), set frecency = -1

15)

add frecency <> 0 to ac queries because...

16)

add code so that for all non place: and unvisited children of livemarks we set frecency to 0, so that we can use it after migration, clear all private data, etc

17)

are there any undesirable side effects of clearing hidden after recalculating frecency? I don't think so, as visiting a bm would unhide it, and then you could always clear private data or remove the visit to have the same effect.

18) note, bookmarks also no in open location and pref page autocomplete (all history autocomplete)


19) after v7 schema migration, run fix it method so place: and unvisited livemark items don't show up, as our migration sets frecency to -1

20)

for the code that resets to -1, better to fix existing code skip over place we will reset to 0, or do what I'm doing and query and reset to 0. note, need the fix it code for after migration, so this keeps it simple. need to verify if my query is a perf issue with lots of history. (less concerned after marco's index fix, however.)

21)

for the ac queries, sort by frecency, then typed, then visit_count because we might not have frecency. in the case of 3b2 migration or clear all private data, lots of places have frecency = -1 (until idle), so we will have lots of frecency ties, so use typed and visit_count to break the ties and provide better results.

22)

should we add "AND h.frecency <> 0" to our autocomplete queries?

yes, because take the following scenario:

a) an unvisited livemark items get frecency of 0, since not a bookmark
b) visit that livemark item, frecency of non 0, shows up in ac, no longer hidden since frecency is non 0.
c) clear history, livemark item no longer hidden, would should up in ac until we recalc on idle
d) I added code FixInvalidFrecenciesForExcludedPlaces() so that we will reset place: and unvisited livemark item frecency to 0 after clear history, expire, etc to solve this problem.

what about migration or first run? as long as we call FixInvalidFrecenciesForExcludedPlaces() we are good.

why <> 0 and not > 0. We want invalid frecencies to show up in ac, as those are bookmarks, migrated places, places after clear all private data, etc before we got a chance to recalculate.

23)

from the ac queries, why remove hidden <> 1

unvisited bookmarks from previous versions will be hidden until visited, but we want them to show up in ac. for stuff we really want hidden, set frecency to 0, like place: and unvisited livemark items. (is there more?)

24)

watch out, don't calculate frecency on history import. instead, set to -1 and come back later on idle.

nsNavHistory::AddPageWithVisit(), passing PR_FALSE for aCalculateFrecency to InternalAddNewPage()

25)

improved mDBOldFrecnecy query, explain why to deitrich

26)

explain this:

XXX frecency for 2745 [1] is [3280] XXX recalc for 2745 [2] is 32800 vs 3280 and 164 vs 164

   // not the same logic above, as a single visit could not both
   // a bookmark visit and a typed visit.  but when estimating a frecency
   // for a place that doesn't have any visits, this will make it so
   // something bookmarked and typed will have a higher frecency than
   // something just typed or just bookmarked.
   if (aIsBookmarked)
     bonus = mUnvisitedBookmarkBonus;
   if (aTyped) 
     bonus += mUnvisitedTypedBonus;

   // assume "now" as our ageInDays, so use the first bucket
   // note, when we recalculate "old frecencies" (see mDBOldFrecencies)
   // this frecency value could be off by an order of 
   // (mFirstBucketWeight / mDefaultBucketWeight)
   pointsForSampledVisits = mFirstBucketWeight * ((bonus + 100.0) / 100.0);

27)


log bug about no star if something bookmarked and livemarked bug

> // NOTE: because we are grouping by h.id, b.id and b.parent > // get collapsed, so if something is both a livemark and a bookmark > // we might not show it as a "star" if the parentId we return is > // the one for the livemark item, and not the bookmark item. > // XXX todo: log a spin off bug on this issue.

28) had to move EnsureCurrentSchema() to after table creation, or we'd bomb with new profiles, and I've removed the transaction

29) would like to order autocomplete by frecency DESC, typed DESC visit_count DESC to break ties, but this is slow, even with an index. (spin off bug)

30) comment about why we add typed bonus to bookmark bonus for frecency of

   // not the same logic above, as a single visit could not both
   // a bookmark visit and a typed visit.  but when estimating a frecency
   // for a place that doesn't have any visits, this will make it so
   // something bookmarked and typed will have a higher frecency than
   // something just typed or just bookmarked.
   if (aIsBookmarked)
     bonus = mUnvisitedBookmarkBonus;
   if (aTyped)
     bonus += mUnvisitedTypedBonus;

31)

email drhipp about indexing and SQLite LIKE contain searches or find the doc about how that doesn't work (marco's comments, too)

32)

we use last <n> visits. if all <n> are embed, but earlier visits were not, when we calculate frecency, we could end up with zero, even though the n+1 visit is typed. is this a spin off bug (I think so). we could make sure that if a place as at least 1 non 0,4 visit, we give it a non-zero frecency, or some other idea to prevent this bug.

32) add comment about why we can't trust visit_count (from older versions) (we counted embed!)

todo

1)

<dietrich> / something bookmarked and typed will have a higher frecency than
<dietrich> / something just typed or just bookmarked.
<dietrich> you mean here that something bookmarked/typed longer ago is weighted more than something recently bookmarked/typed?  

You wrote "longer ago", but in this case, these places have no visits. If we had visits, we would not hit this code. We would have returned here:

  if (numSampledVisits) {
     *aFrecency = (PRInt32) NS_ceilf(aVisitCount * pointsForSampledVisits / numSampledVisits);
     return NS_OK;
   }

I am having a hard time explaining this with words, so let me try this. The current code gives us the following:

f(zero visits, bookmarked, typed) > f(zero visits, bookmarked, not typed) > f(zero visits, not bookmarked, typed) = f(zero visits, not bookmarked, not typed) = 0

Where f() is the frecency calculation function.

Here's how you can end up with these four scenarios:

zero visits, bookmarked, typed: type in http://www.google.com and then bookmark it. clear all history, giving it a frecency of -1. when we recalc on idle, we will give it a frecency of 340 (bookmark bonus + typed bonus, see the browser.frecency.unvisited* prefs). We have no visits, but since it is a bookmark, we want a non-zero frecency.

In irc where I wrote: " while working on the answer to your question above, I noticed a small issue in that code, I'll answer it in a spin off bug as soon as I confirm the issue." This is what I was referring to. I think we should be returning 200 here, instead of 340.

The change would be:

Replace:

 // not the same logic above, as a single visit could not both
 // a bookmark visit and a typed visit.  but when estimating a frecency
 // for a place that doesn't have any visits, this will make it so
 // something bookmarked and typed will have a higher frecency than
 // something just typed or just bookmarked.
 if (aIsBookmarked)
   bonus += mUnvisitedBookmarkBonus;
 if (aTyped)
   bonus += mUnvisitedTypedBonus;

with:

 // assuming mUnvisitedTypedBonus > mUnvisitedBookmarkBonus
 // this makes it so an unvisited, typed bookmark frecency > unvisited, untyped bookmark frecency
 if (aTyped)
   bonus = mUnvisitedTypedBonus
 else if (aBookmarked)
   bonus = mUnvisitedBookmarkBonus;

zero visits, bookmarked, not typed: create a new bookmark by right clicking on a url that you've never visited. frecency will be 140 (browser.frecency.unvisitedBookmarkBonus)

zero visits, not bookmarked, typed: type in http://www.google.com, and annotate it but don't bookmark it. clear all private data, giving it a frecency of -1. on idle, we'd give this thing a value of 0 because visit count is 0.

zero visits, not bookmarked, not typed: this will give us a frecency of 0, which is desired.


x) after fx 2 / fx 3b2 migration, force a few idles? do a massive frecency recalc? how long does each take? do a few on a timer, not on idle to improve the first impression

x) limit chunks to 20 or 40, as 20 is max results?

x)

 // XXX
 // on import / fx 2 migration, is the frecency work going to slow us down?
 // find out if we import history before or after bookmarks for fx 2
 // we might want to skip this stuff, as well as the frecency work
 // caused by GetUrlIdFor() which calls InternalAddNewPage()
 // if we do skip this, after import, we will   // need to call FixInvalidFrecenciesForExcludedPlaces()
 // we might need to call it anyways, if items aren't properly annotated
 // as livemarks feeds yet.

less of a deal as typical bookmarks are much fewer than history? not sure.

note, calling FixInvalidFrecenciesForExcludedPlaces() from from the places import / export code means adding it to the nsINavHistoryService interface

x)

what about the other occurrences of hidden <> 1 in nsNavHistory.cpp? see comment #23 about about why that is bogus. use frecency <> 0?

x)

should we get rid of the hidden column?

x)

for calc frecency, never calc 0 unless place: or unvisited livemark item, so do 1.

x)

for the code that resets to -1, better to skip over livemarks, or fix it afterwards like I'm doing right now?

x)

everything gets frecency of -1, until we visit something or recalc on idle. for fx2 migration, when we import history, we can recalc (some?) at the end. for fx3b2 migration, we can recalc some at the end.

and then wait for idle?

on clear all data (and bm import or restore? should we do a quick recalc of bookmarks? what if they have a lot of bookmarks? should part of our work on idle be to recalc bookmarks?

x)

when calc frecency, never calc 0 unless place: or unvisit livemark item, so do 1.

x)

note, not clearing visit count, so what does that mean for bug #394133

in the places organizer and history sidebar ui, visit count is lazily corrected (except for place: and for unvisited children of livemarks, because we use visit count in our query.

note, untrusted visit count is used in other places, like the smart bookmarks menu

x)

for visit count re-calculation, visit_type NOT in (0,4), is that correct?

x)

 // should we add "AND visit_type NOT IN (0,4)"
 // XXX aren't embed's affecting visit count already in ::AddVisit()?
 // I think we have a bug about not letting them do that.
 // XXX note, should we fix AddVisit() to recalc visit count by using
 // CalculateVisitCount()?

x)

 // should we recalc frecency for all non-livemark item, non place: query
 // bookmark items?  What if there are a lot?  Better to let it happen on idle?
 // if we add a frecency > 0 check to autocomplete, nothing will show up
 // we could have visited a livemark item, cleared history, which marked it
 // as non hidden, then, until recalculation on idle, it wills how up
 // in autocomplete.  we could recalc frecency for (or just make 0)
 // all place: uri and livemark items, and then do frecency <> 0
 // and keep hidden <> 1.  this way, we will show everything in ac
 // after clear all data that we've got, except for place: and livemark items

x)

2 -> 3 migration, frecency = -1, but do we have visit counts? force place: and unvisited livemarks to be zero, do one on idle, wait for rest? 3b2 - > 3 migraiton, frecency = -1, but we do have visit counts. force place: and unvisited livemarks to be zero, do on idle, wait for rest. clear all private data, frecency = -1, but we do have visit counts. force place: and unvisited livemarks to be zero, do on idle, wait for rest.

small area: partial expiration, frecency = -1 for a few places, but we do have visit counts. don't think we need to force place and unvisited livemarks, (maybe we do for unvisit livemaks) , do on idle, wait for rest.

small, but could be big: manual delete places, , but we do have visit counts. force place: and livemarks to be zero, do on idle, wait for rest.

..or whereever we end up with frecency = -1

on all of these, fire the idle timer / invalid frecency

don't reset visit count, use that for determining which places to do recalc first, but always need to recompute visitcount before recalc frecency

x)

do forced "on idle" work in "on visit"?

x)

verify we check oldVisitCount and frecency = -1 everywhere we should be.

x)

possible optimization, if not bookmark and no visit count, bail out early from frequency calculation?

x)

rename the isBookmark arg to isNonHiddenBookmark in the CalculateFrecency() methods, as it's really for non place:, non univisited livemark items?

x)

make sure that we aren't doing too much work when a livemark gets updated, due to removing from hash

x)

first time migration delay: from adding column with -1 to moz_places (acceptable, not much we can do)

x)

determine if we really need the index on frecency

x)

first time, all bogus (need to take the top <x> and recalc), on migration and import from fx 2

x)

decide if we should recalc frecency for bookmarks (remaining places?) after clear all data, or just wait until idle

x)

make sure not too much "on idle" cpu usage

x)

determine if we should do more (or less) invalid / old calc on idle

x)

pref for autocomplete chunk size and timeout (AUTOCOMPLETE_SEARCH_CHUNK_SIZE and AUTOCOMPLETE_SEARCH_TIMEOUT?)

x)

make sure we stop searching once we hit max results, or do we not know if we have hit max results in the backend?

x)

should url bar drop down be frecency or visit date? (keep visit date, so I need to fix my code to be the old way, but ask ux if they like by frecency?)

x)

we do need to recalculate frecency on history import?

x)

recalc on bm import / restore?

x)

XXX when removing duplicate ids, should we be resetting recency to -1?

x)

there might be a few more SQL queries we should precompile. we might be re-parsing SQL over and over on every key stroke.