Firefox/Projects/Places async expiration: Difference between revisions

No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 16: Line 16:
== Related Bugs ==  
== Related Bugs ==  


{{bug|516940}}
{{bug|520165}}
{{bug|520165}}
== Feedback discussion ==
Leave your comments or ideas here please:
http://groups.google.com/group/mozilla.dev.apps.firefox/browse_thread/thread/421f6c7566499e4f#


== Team ==
== Team ==
Line 40: Line 46:
* over engineer things
* over engineer things


== Thoughts to be parsed ==
== Things to expire ==
 
Actually:
* visits: older than a date
* pages: without visits and bookmarks
* annotations: older than a date or orphans
* favicons: older than a date or orphans
* inputhistory: degrade and remove useless
 
In future:
* page thumbnails
 
== When to expire ==
 
* On startup: by no means, expiration should not hurt Ts in any way.
* In background: small chunks, adaptive to user habits
* During syncs: yes and no, we can do that to avoid some fsync, but actually it causes code to be hard to follow, more error prone, and makes sync transactions longer and more prone to corruptions. We also plan to add more expiration stuff in future, that will make things worse. Also we hope to remove temp tables and syncs in future but expiration will survive.
* On idle: larger chunks, but only once near idle start, we don't want to rely too much on idle because on mobile devices it could be an unreachable state. Also we don't want to hit idle hard to avoid stressing mobile and laptop batteries.
* On shutdown: avoid this at all if possible, but we have to take in count privacy.
* On pref changes: if the user changes expiration prefs, we should try to satisfy his choices. this could be done increasing the number of entries expired in background till we reach a good point of cleanup, then going back with usual expiration values.
 
== Visits expiration algorythm ==
 
Is current expiration algorythm too complex?
Current expiration has 2 limits, an hard limit (180 days) and a lazy limit (90 days). It also has a maximum number of pages limit.
We always clear any visit older than the hard limit (this works like the old FX2 setting). We clear visits between the 2 limits if we are over maximum number of pages allowed. we never clear visits before minimum limit.
 
in bug 425219 Connor said:
"Our original goal was to have 180 days of history.  We added the 40k visit
limit as a way of limiting overhead.  But for very heavy users, this shortened
the set of available history dramatically, so we added the minimum cap.  I
don't think a straight 90 day cap is ideal, and I don't really get why
"algorithim complexity" is something end users care about.  If it made expiring
history more painful, sure, we should look at that, but as far as I understand,
that's not the case.  The original bugs have lots of debate, but as far as I'm
concerned, this is working well and as intended, so marking WONTFIX.  Users who
really want to expire after N days can set history_expire_days to the same as
history_expire_days_min and be content."
 
But current algo has clear disadvantages:
 
* queries are more complex and slower than they could be, we have to run multiple queries to check if we are over limit, to check visits in a timeframe and so on.
* it's hard for users to set correct values and understand them, for example we have lots of reports saying "i changed my settings to 6 days because my computer can't handle all of that history but it's still slow" (indeed he changed only the lazy limit).
* it's harder for support to suggest changes to those limits.
 
while the advantages:
* if we can't clear all history fast enough we can't be blamed because user knows (maybe) we hold more history than what he set in preferences dialog.
* we don't have to evaluate how users use history, we set 2 limits and try to do more than expected.
* the pref is always satisfied (visits newer than lazy limit are never touched)
 
At that time we did not have lots of data to take decisions based on numbers, today we have Places Stats, so we can actually evaluate numbers better.


* is expiration algorythm too complex? This actually creates problem due to slower queries, we have an hard and a soft limit, plus a limit on number of pages, that is hardly hit with default settings. Could we go back to a "Retain a maximum of XXX pages for at last YYY days" preference? Would that have any drawback? Actually often users don't notice the _at least_ indication in preferences panel, and they think that setting 0 days will get rid of they history. The lazy limit was mostly added to evaluate how much history we could retain with Places, trying to be smarter toward users, but actually is just slowing us down, and confusing users.
We know we can handle far more than 90 days of history, and we know that the 40000 unique pages limit is hardly hit (avg number of pages is 23000). Places also has better performances today (and will get better).
We could actually move back to the simple hard limit, to have advantages:
* simple and fast queries (only 1 limit!)
* simple to understand for anyone


* Should we hook into PlacesDBFlush? actually expiration code hooks into it to run just before a flush, but this makes the code more complex and less readable/self-contained... If we are going to increase number of async expiration tasks and we plan to remove temp tables, could make sense to strip out this code path from the flusher.
Disadvantages:
* the pref could not be satisfied if we impose a page limit (so we expire some visit because we have too many pages)


* Actually we have a bunch of strange notifications, we have onDeleteURI but also onPageExpired. On the other side we don't have onDeleteVisit, why?
Actually the disadvantage could be fixed in two ways:
* discard any pages limit, or put one hard to hit (just-in-case) and not tweakable (we can't handle more than that, sorry!). Based on Places Stats avg number of pages is 23000, users that increased the number of days also increased the pages limit, so it's not really helpful.
* change the pref to "Save a maximum of XXX unique pages for at last YYY days", so it's clear to the user we will expire once one of the two conditions is hit.


* During batches often we don't notify, this was done for perf purposes, but really should be observers that ignore notifications during batches, not us that don't fire them. This makes hard for extensions and third party users to follow what happens. We should ensure to always notify changes (this does not happen on clearHistory since it's pretty clear that all history will be cleared, even if at that point you can't know if an uri has been removed or not, it could have been a bookmark. Should we notify onDeleteURI instead?).
== Who to notify ==


* If possible we should completely avoid or highly reduce expiration at shutdown. We must expire in background, adapting to the situation. Thus making shutdown expiration useless.
We have to fire notifications in a sane way when we expire pages (onDeleteURI vs onPageExpired).
On the other side notifying about each removed visit would be a perf hog (what happens when we expire 20000 visits?)
We should probably notify onDeleteURI when a page is removed, and onDeleteVisits when visits for an uri are expired but page is not expired. currently onPageExpired notification is an hybrid between the two.


* Idle can be a problem, if from one side it's pretty cool to be able to expire large bunches of informations on idle, on the other side we should consider some problematic. First user could be idle because watching a video, we should not hit him with large bunch of disk I/O. Also idle is a particular condition on mobile devices, could be some user will just open the browser, visit a page, and close the browser, in these situations idle won't ever be hit, so we should not rely to much on idle. Luckily in those situations the size of data to be managed is smaller.
Currently we avoid to notify page removals during large batch deletes, if that is not a problem for clearHistory (everything is removed), it's hard for third parties to follow pages, so we could evaluate to replace current notifications with multiple entries notifications (onDeleteURIs (list_of_uris), onDeleteVisits(list_of_uris) to notify just once.


* Expiration is not needed at startup, so we should try to init it later, but being part of toolkit we can't just use browser delayed startup. Maybe just add it as an history observer category, but start expiring later, on a timer.
Observers should ignore or cache notifications during batches, not us not sending them.

Latest revision as of 20:22, 19 October 2009

Summary

Decouple expiration from history service and make it completely async and adaptive to user habits (expire harder for heavy history users, less for casual users).

Current Status

Getting started with design.

Next Steps

  • Take decision on position and form of the code (service? component? module?)
  • Evaluate interactions with other components/services
  • Code it
  • Test it

Related Bugs

bug 516940 bug 520165

Feedback discussion

Leave your comments or ideas here please: http://groups.google.com/group/mozilla.dev.apps.firefox/browse_thread/thread/421f6c7566499e4f#


Team

  • Project Leader: mak
  • Reviewer: mano
  • Consultant: Places Team

Designs

Separate expiration from history service, make it async in background.

Goals/Use Cases

  • Work in background: no ui-locking
  • Avoid idle if possible: mobiles could hardly hit idle, it's unclear how often a mobile dev can use idle. In any case we should not loop in idle to avoid killing batteries (mobile and laptop)
  • simplify bloat history code
  • simplify and add coherency to expiration code
  • ensure notifications consistency

Non Goals

  • over engineer things

Things to expire

Actually:

  • visits: older than a date
  • pages: without visits and bookmarks
  • annotations: older than a date or orphans
  • favicons: older than a date or orphans
  • inputhistory: degrade and remove useless

In future:

  • page thumbnails

When to expire

  • On startup: by no means, expiration should not hurt Ts in any way.
  • In background: small chunks, adaptive to user habits
  • During syncs: yes and no, we can do that to avoid some fsync, but actually it causes code to be hard to follow, more error prone, and makes sync transactions longer and more prone to corruptions. We also plan to add more expiration stuff in future, that will make things worse. Also we hope to remove temp tables and syncs in future but expiration will survive.
  • On idle: larger chunks, but only once near idle start, we don't want to rely too much on idle because on mobile devices it could be an unreachable state. Also we don't want to hit idle hard to avoid stressing mobile and laptop batteries.
  • On shutdown: avoid this at all if possible, but we have to take in count privacy.
  • On pref changes: if the user changes expiration prefs, we should try to satisfy his choices. this could be done increasing the number of entries expired in background till we reach a good point of cleanup, then going back with usual expiration values.

Visits expiration algorythm

Is current expiration algorythm too complex? Current expiration has 2 limits, an hard limit (180 days) and a lazy limit (90 days). It also has a maximum number of pages limit. We always clear any visit older than the hard limit (this works like the old FX2 setting). We clear visits between the 2 limits if we are over maximum number of pages allowed. we never clear visits before minimum limit.

in bug 425219 Connor said: "Our original goal was to have 180 days of history. We added the 40k visit limit as a way of limiting overhead. But for very heavy users, this shortened the set of available history dramatically, so we added the minimum cap. I don't think a straight 90 day cap is ideal, and I don't really get why "algorithim complexity" is something end users care about. If it made expiring history more painful, sure, we should look at that, but as far as I understand, that's not the case. The original bugs have lots of debate, but as far as I'm concerned, this is working well and as intended, so marking WONTFIX. Users who really want to expire after N days can set history_expire_days to the same as history_expire_days_min and be content."

But current algo has clear disadvantages:

  • queries are more complex and slower than they could be, we have to run multiple queries to check if we are over limit, to check visits in a timeframe and so on.
  • it's hard for users to set correct values and understand them, for example we have lots of reports saying "i changed my settings to 6 days because my computer can't handle all of that history but it's still slow" (indeed he changed only the lazy limit).
  • it's harder for support to suggest changes to those limits.

while the advantages:

  • if we can't clear all history fast enough we can't be blamed because user knows (maybe) we hold more history than what he set in preferences dialog.
  • we don't have to evaluate how users use history, we set 2 limits and try to do more than expected.
  • the pref is always satisfied (visits newer than lazy limit are never touched)

At that time we did not have lots of data to take decisions based on numbers, today we have Places Stats, so we can actually evaluate numbers better.

We know we can handle far more than 90 days of history, and we know that the 40000 unique pages limit is hardly hit (avg number of pages is 23000). Places also has better performances today (and will get better). We could actually move back to the simple hard limit, to have advantages:

  • simple and fast queries (only 1 limit!)
  • simple to understand for anyone

Disadvantages:

  • the pref could not be satisfied if we impose a page limit (so we expire some visit because we have too many pages)

Actually the disadvantage could be fixed in two ways:

  • discard any pages limit, or put one hard to hit (just-in-case) and not tweakable (we can't handle more than that, sorry!). Based on Places Stats avg number of pages is 23000, users that increased the number of days also increased the pages limit, so it's not really helpful.
  • change the pref to "Save a maximum of XXX unique pages for at last YYY days", so it's clear to the user we will expire once one of the two conditions is hit.

Who to notify

We have to fire notifications in a sane way when we expire pages (onDeleteURI vs onPageExpired). On the other side notifying about each removed visit would be a perf hog (what happens when we expire 20000 visits?) We should probably notify onDeleteURI when a page is removed, and onDeleteVisits when visits for an uri are expired but page is not expired. currently onPageExpired notification is an hybrid between the two.

Currently we avoid to notify page removals during large batch deletes, if that is not a problem for clearHistory (everything is removed), it's hard for third parties to follow pages, so we could evaluate to replace current notifications with multiple entries notifications (onDeleteURIs (list_of_uris), onDeleteVisits(list_of_uris) to notify just once.

Observers should ignore or cache notifications during batches, not us not sending them.