638
edits
| Line 35: | Line 35: | ||
Removing an entry simply replaces its <code>key</code> with some sentinel while leaving the <code>chain</code> intact. | Removing an entry simply replaces its <code>key</code> with some sentinel while leaving the <code>chain</code> intact. | ||
== Theory == | |||
'''Memory.''' Every map data structure must store the keys and values. A map with ''N'' entries must use at least ''NE'' memory, where ''E'' is the size of a key-value pair (we assume the data can’t be compressed). | |||
In the following discussion we neglect the effect of deletions, which can cause both kinds of table to have extra memory that isn't in use at a given moment. | |||
An open addressing hash table has additional overhead depending on the fill factor. <code>OpenTable</code> in particular has a maximum fill factor ''F''=3/4. When it becomes more than 3/4 full, it doubles in size, becoming just over 3/8 full. The amount of overhead in an <code>OpenTable</code> is ''(1/f)NE'', where ''f'' is the actual current fill ratio of the table and ''F/2 < f ≤ F''. | |||
A Close table has three sources of overhead: the entire <code>hashTable</code>, the <code>chain</code> pointers in the <code>dataTable</code>, and the unused portion of the <code>dataTable</code>. In <code>CloseTable</code>, the <code>chain</code> pointer causes each entry to require ''(3/2)E'' memory rather than ''E''. The <code>dataTable</code> doubles in size when it becomes completely full, so its size is ''(3/2)(1/u)NE'' where ''u'' is the proportion of the <code>dataTable</code> that is in use and ''1/2 < u ≤ 1''. The <code>hashTable</code> is always 1/16 the size of the <code>dataTable</code> (1/8 on 64-bit platforms), so the total size of both tables is ''(1+1/16)(3/2)(1/u)NE'' (substitute 1/8 on 64-bit platforms). | |||
For back-of-envelope purposes, then, ''f'' may be expected to be about ''3F/4'' (halfway between F/2 and F) and ''u'' about 3/4 on average; plugging these numbers in, we expect an <code>OpenTable</code> to have about (1/(3F/4))×100%-100% = 78% memory overhead, and a <code>CloseTable</code> about 113% (125% on 64-bit), on average. This would mean that the <code>CloseTable</code> would be 20% larger on average (27% larger on 64-bit). However, see the Results section for a picture of how the two implementations compare in practice. | |||
'''Speed.''' We consider the speed of lookups only. The <code>get</code>, <code>has</code>, <code>set</code>, and <code>delete</code> operations all start with a lookup. We will quantify the speed of lookups in terms of how many memory locations must be accessed. To simulate locality effects, we treat multiple reads from a single entry, or from the table object, as a single read. | |||
In an open addressing hash table with fill factor ''f'', an unsuccessful lookup performs about ''1/(1-f)'' probes.[https://en.wikipedia.org/wiki/Hash_table#Performance_analysis] In an <code>OpenTable</code>, at worst ''f=F=3/4'' and so an unsuccessful lookup performs 4 probes on average. In the usual case, ''f=3F/4=9/16'', so about 2.3 probes on average. | |||
<table> | |||
<tr> | |||
<th rowspan=2>Data structure</th> | |||
<th colspan=2>Expected cost of unsuccessful lookup (miss)</th> | |||
<th colspan=2>Expected cost of successful lookup (hit)</th> | |||
</tr> | |||
<tr> | |||
<th>Maximum load</th> | |||
<th>Typical load</th> | |||
<th>Maximum load</th> | |||
<th>Typical load</th> | |||
</tr> | |||
<tr> | |||
<td><code>OpenTable</code> | |||
<td>5 | |||
<td>3.3 | |||
<td>- | |||
<td>- | |||
</tr> | |||
<tr> | |||
<td><code>CloseTable</code> | |||
<td>- | |||
<td>- | |||
<td>- | |||
<td>- | |||
</tr> | |||
</table> | |||
= Method = | = Method = | ||
edits