Browser History: Difference between revisions

m
No edit summary
m (→‎Link coloring: grammar)
 
(38 intermediate revisions by 9 users not shown)
Line 1: Line 1:
Calculus I INTRODUCTION
<center>'''Note: The information on this page is likely out of date and is only kept for historical reasons.'''</center>


Calculus, branch of mathematics concerned with rates of change, gradients of curves, maximum and minimum values of functions, and the calculation of lengths, areas, and volumes. It is widely used, especially in science and engineering, wherever continuously varying quantities occur.  
== Introduction ==
The places history system is a redesign of the Firefox global history system using the new SQLite-based mozStorage APIs.  This system provides additional performance, flexibility, and querying capabilities.


II HISTORICAL DEVELOPMENT
See also:
* [[Places]]
* [[Annotations]]
* [[Firefox:2.0_Bookmarks]]


Calculus derives from ancient Greek geometry. Democritus calculated the volumes of pyramids and cones, probably by regarding them as consisting of infinitely many cross-sections of infinitesimal (infinitely small) thickness, and Eudoxus and Archimedes used the “method of exhaustion�?, finding the area of a circle by approximating it arbitrarily closely with inscribed polygons. However, difficulties with irrational numbers and the paradoxes of Zeno prevented a systematic theory developing. In the early 17th century Cavalieri and Torricelli extended the use of infinitesimals, while Descartes and Fermat used algebra for determining areas and tangents (integration and differentiation, in modern terms). Fermat and Barrow knew these two processes were closely related, and Newton (in the 1660s) and Leibniz (in the 1670s) proved the Fundamental Theorem of Calculus, that they are mutually inverse. Newton's discoveries, motivated by his theory of gravitation, preceded Leibniz's, but his delays in publishing them caused bitter priority disputes, and Leibniz's notation was eventually adopted.  
== Places history features ==
The major places history views will be queries over places visited by time. The user will be able to select any date range and see when they visited each page.  Actually showing which pages were visited during a given time period means we will have to store every time the user visits any page, rather than just the last visit time. This fixes some problems with the old history system where items would move around if you visited them again.


The 18th century saw widespread applications of calculus, but imprecise use of infinite and infinitesimal quantities and geometric intuition still caused confusion and controversy about its foundations, the philosopher Berkeley being a notable critic. The 19th-century analysts replaced these vague notions with firm foundations based on finite quantities: Bolzano and Cauchy defined limits and derivatives precisely, Cauchy and Riemann did likewise for integrals, and Dedekind and Weierstrass for real numbers. For instance, it was now understood that differentiable functions are continuous, and continuous functions integrable, but both converses fail. In the 20th century, non-standard analysis belatedly legitimized infinitesimals, while the development of computers increased the applicability of calculus.  
These view should be as organized and as concise as possible.  The old history system includes many redirect pages in the view, and doesn't have things like favicons and good grouping that might make it easier to locate pages of interest.  As a result, favicons for every page will be stored for viewing in history, and we will provide better domain-style grouping and a calendar for picking days rather than a tree view.


III DIFFERENTIAL CALCULUS
The second most user-visible history feature will be keyword searching.  This will be essentially unchanged from the old system, which is too bad because the current system sucks.  Because there is no word index, we do a brute-force search over all titles for the keyword in question.  While sometimes useful, this is not always what you want, resulting in a number of third-party history indexing products.  Long term, it would be nice to index pages you visit.  For the coming release it might be nice to provide a hook for third-party products so they can provide text searching capabilities to the places system.


Differential calculus is concerned with rates of change. Suppose that two variables x and y are related by an equation y = f(x) for some function f, indicating how the value of y depends on the value of x. For instance, x could represent time, and y the distance travelled by some moving object at time x. A small change h in x, from a value x0 to x 0 + h, induces a change k in y from y 0 = f(x0) to y0 + k = f(x0 + h); thus k = f( x0 + h) - f(x0), and the ratio k/h represents the average rate of change of y as x increases from x0 to x0 + h. The graph of the function y = f(x) is a curve in the xy-plane, and k/h is the gradient of the line AB through the points A = (x 0,y0) and B = (x0 + h,y 0 + k) on this curve; this is shown in Figure 1, where h = AC and k = CB, so k/h is the tangent of the angle BAC.  
The last major change will be the ability to define complex queries over the history and bookmark systems. Most users will not be generating such queries, but they can be used to provide a variety of nice features. For example, a query folder can be created for "My favorite pages" which contains the 20 most frequently visited URLs in history.  A query for the most frequently visited pages that are not bookmarked provides a good candidate list for suggesting bookmarks.


If h approaches 0, with x0 fixed, then k/h approaches the instantaneous rate of change of y at x0; geometrically, B approaches A along the graph of y = f(x), and the line AB approaches the tangent AT to the graph at A, so k/h approaches the gradient of the tangent (and hence of the curve) at A. We therefore define the derivative f′(x0) of the function y = f(x) at x0 to be the value (or limit) which k/h approaches as h approaches 0, written:  
== Example ==
This screenshot shows some of these ideas:


This represents both the rate of change of y and the gradient of the graph at A. When x is time and y is distance, for example, the derivative represents instantaneous velocity. Positive, negative, or zero values of f′(x0) respectively indicate that f(x) is increasing, decreasing, or stationary at x 0. The derivative is a new function f′(x) of x, sometimes denoted by dy/dx, df/dx or Df. For example, let y = f(x) = x2, so the graph is a parabola. Then
[[File:History grouping.png]]<br />''Image credit: Brett Wilson, http://maxradi.us/''


so k/h = 2x0 + h, which approaches 2x 0 as h→ 0. Thus the gradient when x = x0 is 2x0, and the derivative of f(x) = x2 is f′(x) = 2 x. Similarly xm has derivative mxm-1 for each fixed m. The derivatives of all commonly occurring functions are known: see the table for some examples.  
All pages have favicons for better visual differentiation and grouping.  Pages are arranged into "sessions" consisting of followed links.  Typing a new URL or following a bookmark gives you a new "session" which appears as a dotted line in this image. This is an actual screenshot, but the ugly redirect pages have been manually removed.  Proper tracking of redirect pages is pending, see [[Browser_History:Redirects]].


Some words of caution are needed here. Firstly, to find the derivative we make h small (positive or negative), but never zero: this would give k/h = 0/0, which is meaningless. Secondly, not every function f has a derivative at each x0, since k/h need not approach a limit as h→ 0. For instance, f(x) = | x| has no derivative at x0 = 0, since k/h is 1 or -1 as h> 0 or h< 0; geometrically, the graph has a corner (and hence no tangent) at A = (0,0). Thirdly, although the notation dy/dx suggests the ratio of two numbers dy and dx (denoting infinitesimal changes in y and x), it is really a single number, the limit of a ratio k/h as both terms approach 0.  
An informal goal is to save and expose enough information for a variety of interesting visualization or suggestion extensions to be created. For example:
* A graph of the user's browsing behavior, showing which pages were visited and when and how the user navigated between them.
* Suggesting links. By looking at what link the user frequently visits from the current page, perhaps some sort of suggestion or prefetching could take place. For some pages, this indication would be very strong. Other AI-like things could potentially be done. For example, many people visit the same set of sites when they open the browser in the morning. It would be nice if something could be done to make this nicer.


Differentiation is the process of calculating derivatives. If a function f is formed by combining two functions u and v, its derivative f′ can be obtained from u and v by simple rules; for instance the derivative of a sum is the sum of their derivatives, that is, if f = u + v (meaning that f(x) = u(x) + v(x) for all x) then f′ = u′ + v′, and a similar rule (u - v)′ = u′ - v′ applies to differences. If a function is multiplied by a constant, then so is its derivative, that is, (cu)′ = cu′ for any constant c. The rules for products and quotients are less obvious: if f = uv then f′ = uv′ + u′v, and if f = u/v then f′ = ( u′v-uv′)/v2 provided v(x) ≠ 0.  
== Bookmark pages and notes ==
Firefox 2 will likely have some kind of bookmark button for easy bookmarking of pages. This button will be illuminated whenever the user is on a bookmarked page, but redirects provide an extra problem. If the bookmark page is moved, or  more commonly, if the user manually sets a bookmark, for example "amazon.com", they will be redirected. This destination page won't be bookmarked, so the button will not be illuminated. The old history system also had problems with redirected bookmarks because the favicons would never get set properly. Therefore, we need to know more about redirects to handle these cases correctly.


Using these rules, quite complicated functions can be differentiated: for instance x2 and x5 have derivatives 2 x and 5x4, so the function 3x2 - 4x5 has derivative (3x2 - 4x 5)′ = (3x2)′ - (4x5)′ = 3.( x2)′ - 4.(x5)′ = 3.(2x) - 4.(5 x4) = 6x - 20x4. More generally, any polynomial f(x) = a0 + a1x + ... + anxn has derivative f′(x) = a1 + 2a2 x + ... + nanxn-1; in particular, constant functions have derivative 0.  
If other annotations are stored with pages, redirects could also pose problems. For example, if I have notes on a page and it is redirected, my notes will not be available to me. There should be a way for services using annotation functionality to see if the current page was ever redirected from somewhere else, and to also check those pages for annotations of interest. More on redirects can be read on [[Browser_History:Redirects]].


If y = u(z) and z = v(x), so that y depends on z and z depends on x, then y = u(v(x)), so y depends on x, written y = f(x) where f is the composition of u and v; the chain rule states that dy/dx = (dy/dz).(dz/dx), or equivalently f′(x) = u′(v(x)). v′(x). For instance, if y = ez where e = 2.718 ... is the exponential constant, and z = ax where a is any constant, then y = eax; now dy/dz = ez (see table) and dz/dx = a, so dy/dx = aeax.  
== Required information ==
It will be necessary to collect and store much more information than is done in the old history system. To extract information about sequences of page views, we need to know exactly which page generated each navigation. Because the same page could be opened in more than one tab/window at once, the URL is not sufficient for this task.


Many problems can be formulated and solved using derivatives. For example, let y be the amount present in a sample of radioactive material at time x. According to theory and observation, the sample decays at a rate proportional to the amount remaining, that is, dy/dx = ay for some negative constant a. To find y in terms of x, we therefore need a function y = f(x) such that dy/dx = ay for all x. The most general such function is y = ce ax where c is a constant. Since e0 = 1 we have y = c when x = 0, so c is the initial amount present (at time x = 0). Since a<0 we have eax→ 0 as x increases, so y→ 0, confirming that the sample gradually decays to nothing. This is an example of exponential decay, shown in figure 3a. If a is a positive constant, we obtain the same solution y = ceax, but as time progresses y now increases rapidly (since eax does when a>0); this is exponential growth, shown in figure 3b and observed in nuclear explosions and certain animal communities, where growth-rate is proportional to population.  
== Database design ==
The history will be stored in two SQL tables&nbsp;— "moz_history" and "moz_historyvisit".


IV INTEGRAL CALCULUS
The "history" table essentially duplicates the functionality of the current [[Mork]] history table: It contains:
* Unique ID (primary key)
* URL
* Title
* User-defined title
* Visit count
* Last visit date
* Host name (see below)
* Hidden (bool)
* Typed (bool)


Integral calculus involves the inverse process to differentiation, called integration. Given a function f, we seek a function F with derivative F′ = f; this is an integral or antiderivative of f, written F(x) = ∫ f(x)dx or simply F = ∫f dx (a notation we will explain later). Tables of derivatives can be used for integration: thus x2 has derivative 2x, so 2x has x2 as an integral. If F is any integral of f, the most general integral of f is F + c, where c is an arbitrary constant called the constant of integration; this is because a constant has derivative 0, so ( F + c)′ = F′ + c′ = f + 0 = f. Thus ∫2xdx = x2 + c, for instance.  
Note that the host name is stored backwards, unlike the current table. This is done so that it can be indexed alphabetically and we can quickly pull out all pages within any "mozilla.org" domain by asking for hostname fields that begin with "gro.allizom." A period is always appended to the reversed hostname.


The basic rules for integrating compound functions resemble those for differentiation. The integral of a sum or difference is the sum or difference of their integrals, and likewise for multiplication by a constant. Thus x = y.2x has integral yx2, and similarly ∫xm dx = xm+1/(m + 1) for any m≠ -1. (We exclude m = -1 to avoid dividing by 0; the natural logarithm ln|x| is an integral of x-1 = 1/x for any x≠ 0.) Integration is generally harder than differentiation, but many of the more familiar functions can be integrated by these and other rules (see the table).  
The second table stores transitions between pages, which is information unavailable now.
* Source *visit* ID
* Destination page ID
* Time
* Transition type


A classic application of integration is to calculate areas. Let A be the area of the region between the graph of a function y = f(x) and the x-axis, for a≤x≤b. For simplicity, assume that f(x) ≥ 0 between a and b. For each x≥a, let L(x) be the area of this region to the left of x, so we need to find A = L(b). First we differentiate L(x). If h is a small change in x, the region below the graph between x and x + h is approximately a rectangle of height f(x) and width h (see figure 4); the corresponding change k = L(x + h) - L(x) in area is therefore approximately f(x)h, so k/h is approximately f(x). As h→ 0 these approximations become more exact, so k/h→ f(x) and hence L′(x) = f(x). Thus L is an integral of f, so if we know any integral F of f then L = F + c for some constant c. Now L(a) = 0 (since the region to the left of x vanishes when x = a), so c = -F(a) and hence L(x) = F(x) - F(a) for all x≥a. In particular, A = L(b) = F(b) - F(a), written
Transition type will hopefully contain info about whether the link was a redirect clicked, opened in new tab/window, typed, just an image on the page, etc.


This is the Fundamental Theorem of Calculus, valid whenever f is continuous between a and b, provided we assign negative areas to any regions below the x-axis, where f(x) < 0. (Continuity means that f(x) → f(x0) as x→x0, so f has an unbroken graph.) For example f(x) = x2 has integral F( x) = x3/3, so
Because the source for each visit is another visit, not just a URL, we can track transitions between specific viewings of pages.


and with this formula it is possible to work out a large number of useful quantities. For example, the volume of a cone of height h and radius r can be found by evaluating the expression ∫p (rx / h)2 dx between the limits x = 0 and x = 1; this is because the radius at a distance x below the apex of the cone is rx / h and the cross-sectional area is p (rx / h)2. The result is p r2 h / 3.  
== Performance ==
The SQLite database system does file-level locking to manage access to the database. This means that for every transaction (even read), the database file must be checked for changes to update the cache, and locked. Most systems can do this quickly, although this operation is clearly not good for performance. Some Linux/Unix users may have their home directories and profiles stored on a network-mounted drive. In these cases, individual database accesses can be relatively expensive.


Here is a definite integral of f; this is a number, whereas the indefinite integral ∫f(x)dx is a function F(x) (more precisely, a set of functions F(x) + c). The symbol ∫ (a 17th-century S) suggests summation of areas f(x)dx of infinitely many rectangles of height f(x) and infinitesimal width dx; more precisely, is the limit of a sum of finitely many rectangular areas, as their widths approach 0.  
Therefore, it is very important that database access be minimized and grouped in transactions. A new query/result format will be used instead of RDF. RDF requires many extra transactions, to retrieve the ID of a URL, then to retrieve the title, then to retrieve the visit date, etc. This makes performance, especially for network users, unacceptably slow. The new query system will retrieve one set of results for one query, meaning that there will be only one database access or the accesses can be internally grouped in a transaction. Initial tests of this system show that is is plenty fast enough over a good network and virtually instant for local users (the majority).


The derivative dy/dx = f′(x) of a function y = f(x) can be differentiated again to obtain a second derivative, denoted by d2y/dx 2, f′′(x) or D2f. If x is time and y is distance travelled, for instance, so that dy/dx is velocity v, then d2y/dx2 = dv/dx is rate of change of velocity, that is, acceleration. By Newton's second law of motion, a body of constant mass m subject to a force F undergoes an acceleration a satisfying F = ma. For example, if the body falls under the gravitational force F = mg (where g is the gravitational field strength) then ma = F = mg implies a = g, so dv/dx = g. Integrating, we have v = gx + c where c is constant; putting x = 0 shows that c is the initial velocity. Integrating dy/dx = v = gx + c, we have y = ygx2 + cx + b where b is constant; putting x = 0 shows that b is the initial value of y.  
=== Link coloring ===
Link coloring is perhaps the most important performance issue, since this directly affects page load time. A naive approach requires querying the database for each link that comes in. Even when the history database is local, asking the OS for a file lock for each link on a page will be very bad for performance.


Higher derivatives f(n)(x) = dn y/dxn = Dnf of f( x) are found by successively differentiating n times. Taylor's Theorem states that if f(x) can be represented as a power series f(x) = a0 + a1x + a2x2 + ... + anx n + ... (where a0,a1, ... are constants), then an = f(n)(0)/n! where 0!=1 and n!= 1 × 2 × 3 × ... × n for all n≥ 1. Most commonly used functions can be represented as power series; for instance if f(x) = ex then f(n)(x) = ex for all n, so f(n)(0) = e0 = 1 and hence:
To solve this problem, URLs will be loaded into an in-memory database when the application starts (or, perhaps, the first time they are required). These links can be sorted so they can be accessed very quickly, and require no OS calls or file locks. This list will only contain URLs, and none of the extra information such as visit dates, visit counts, and titles.


Because these data will reside in memory, it is important to not take too much space. If we succeed in the goal of allowing users to keep large amounts of history, size could be significant. We will, therefore, want to limit the number of URLs that we bring into memory, either allowing, for example, the past 10,000 URLs, or the past 30 days of visits.


A two-level approach may improve perceived size. Most URLs in the table are advertisements and random images from Web sites to which the user will never see a link, and probably wouldn't even care about if they did. Then there are the top-level pages (with hidden=false) that are the top-level URLs that the user actually visited. They are much more likely to care about these top-level links. To save space and speed things up, we could keep the past X URLs in memory, plus the previous Y top-level URLs. It is unlikely that users will even notice that most of their history links are not being used for link coloring.


 
=== Database size ===
V PARTIAL DERIVATIVES
I just imported my [[Mork]] history database file into the equivalent SQLite format. The SQLite file size was one-tenth of the original Mork file. So, even though we will be storing more information (every time a page is visited) overall database size will probably be smaller.
 
Functions of several variables can also have derivatives. Let z = f(x,y), so z depends on x and y. Temporarily holding y constant we can regard z as a function of x, and differentiating this gives the partial derivative ¶z/¶x = ¶f/¶x; similarly, keeping x constant and differentiating with respect to y we obtain ¶z/¶y = ¶f/¶y. For instance, if z = x2 -xy + 3y2 then ¶z/¶x=2x-y and ¶z/¶y= -x+6y. Geometrically, an equation z = f(x,y) defines a surface in three-dimensional space; if the x- and y-axes are horizontal and the z-axis is vertical, then ¶z/¶x and ¶z/¶y represent the gradients of this surface at the point (x,y,z) in the directions of the x- and y-axes. Partial derivatives can also be calculated for functions of more than two variables, by keeping all but one variable temporarily constant; higher partial derivatives can be defined by repeating this operation. Partial derivatives are important in applied mathematics, where functions often depend on several variables such as space and time.
Confirmed users
24

edits