Places:FsyncApproach: Difference between revisions
| Comrade693 (talk | contribs)  (→Partitioning Approach:  Updated cons) | Comrade693 (talk | contribs)   (→Long Open Transaction Approach:  Updated cons and swag) | ||
| (11 intermediate revisions by 3 users not shown) | |||
| Line 11: | Line 11: | ||
| Pros: | Pros: | ||
| * Temp tables are in-memory, so there are no filesystem writes when inserting into the temp table. | |||
| Cons: | Cons: | ||
| * Adds significant complexity to queries.  However, this can be mitigated greatly by using VIEWs that join the temporary table with the permanent one.  VIEWs cannot be written to though, so we'll either have to figure out which table to update (which can cause writes if it is the permanent table) or copy the data into the temp table.  For the latter solution, when we sync the temporary table we would have to use INSERT OR REPLACE to update the permanent table.  I'm not sure how this will effect UNION statements however (ordering may matter here - temp table first?  UNION drops duplicated data as far as I know). | * Adds significant complexity to queries.  However, this can be mitigated greatly by using VIEWs that join the temporary table with the permanent one.  VIEWs cannot be written to though, so we'll either have to figure out which table to update (which can cause writes if it is the permanent table) or copy the data into the temp table.  For the latter solution, when we sync the temporary table we would have to use INSERT OR REPLACE to update the permanent table.  I'm not sure how this will effect UNION statements however (ordering may matter here - temp table first?  UNION drops duplicated data as far as I know). | ||
| * Have to track primary key max index ourself for inserts, since the temp table won't be able to AUTOINCREMENT like the main table.  We can mitigate this by having sqlite start at a certain value.  See [http://www.mail-archive.com/sqlite-users@sqlite.org/msg35131.html this e-mail thread] and [http://sqlite.org/autoinc.html this page] for more details. | * Have to track primary key max index ourself for inserts, since the temp table won't be able to AUTOINCREMENT like the main table.  We can mitigate this by having sqlite start at a certain value.  See [http://www.mail-archive.com/sqlite-users@sqlite.org/msg35131.html this e-mail thread] and [http://sqlite.org/autoinc.html this page] for more details. | ||
| * How do we update existing data? [http://www.mail-archive.com/sqlite-users@sqlite.org/msg35184.html discussion list question] | |||
| Tasks: | Tasks: | ||
| Line 22: | Line 23: | ||
| * modify all queries that interact w/ the partitioned table(s) | * modify all queries that interact w/ the partitioned table(s) | ||
| * periodic flush of temp to permanent table (ensuring pk sync) | * periodic flush of temp to permanent table (ensuring pk sync) | ||
| swag: 3+ weeks | |||
| =Long Open Transaction Approach= | =Long Open Transaction Approach= | ||
| Line 38: | Line 41: | ||
| Pros: | Pros: | ||
| * Less overall fsycns | |||
| * Easy (almost painless) | |||
| Cons: | Cons: | ||
| * Can lose data if we crash with an open transaction | |||
| * Still involves writes to disk | |||
| * If anyone else (non-places) fsyncs, we still lose | |||
| * Can no longer make a group of operations atomic with transactions (no nested transaction support). | |||
| swag: 1-2 weeks | |||
| =Split Database Approach= | =Split Database Approach= | ||
| Line 45: | Line 56: | ||
| From Shaver: | From Shaver: | ||
| <blockquote>We could also just use two databases, one for history (with sync=off) and one for bookmarks (with sync=normal) and query against them both.  Then you would get an fsync/s_f_r only when you updated a bookmark, and not during normal browser operation, and we would not be risking loss of bookmark data during crash.</blockquote> | <blockquote>We could also just use two databases, one for history (with sync=off) and one for bookmarks (with sync=normal) and query against them both.  Then you would get an fsync/s_f_r only when you updated a bookmark, and not during normal browser operation, and we would not be risking loss of bookmark data during crash.</blockquote> | ||
| Pros: | |||
| * The common task of adding a history visit is cheaper | |||
| Cons: | |||
| * Likely false assumption that history is less important than bookmarks (smart location bar changed this) | |||
| * Tight coupling between the two databases | |||
| * New history visits require a write to both (likely) | |||
| swag: 1-2+ months | |||
| =Grow File in Large Hunks Approach= | |||
| Summary from Andrew Morton: | |||
| <blockquote> | |||
| If the file does need to grow during regular operation then I'd suggest | |||
| that it be grown in "large" hunks - say, extend it (with write()) by a | |||
| megabyte at a time.  Then fsync it to get the metadata written.  Then | |||
| proceed to use sync_file_range() for the now-non-extending small writes. | |||
| So the large-write-plus-fsync is "rare". | |||
| </blockquote> | |||
| Pros: | Pros: | ||
| * Can reasonably easily be implemented in vfs sqlite3_io_methods. | |||
| Cons: | Cons: | ||
| * Only a solution for Linux. | |||
| * Still waits on ([https://bugzilla.mozilla.org/show_bug.cgi?id=421482#c152 less significant disk access]) regularly, and so may need to be combined with async IO or another approach for optimal behavior. | |||
| swag: 2 weeks | |||
| =Journaling Virtual Filesystem Approach= | |||
| Summary from Karl: | |||
|   <blockquote> | |||
|   By implementing a virtual filesystem with data journaling, the vfs can sync | |||
|   and empty its journal when it chooses.  The filesystem would satisfy the | |||
|   SQLITE_IOCAP_SEQUENTIAL characteristics and so the database would remain | |||
|   consistent even when it has synchronous == OFF. | |||
|   </blockquote> | |||
|   <blockquote> | |||
|   Such a virtual filesystem implemented using sqlite3_io_methods could log | |||
|   changes immediately to its own replay journal file on the underlying | |||
|   filesystem, and maintain a lookup table of the pages that have not yet been | |||
|   written to their specific real files. | |||
|   </blockquote> | |||
|   <blockquote> | |||
|   Pages in the journal would need to be occasionally synced | |||
|   (possibly with sync_file_range), and then copied to their real files.  The | |||
|   pages in the real files would need to be synced before their corresponding | |||
|   journal pages become free to be reused.  In a single-thread implementation, | |||
|   this could be done when the lookup table is too large to maintain | |||
|   efficiently and then the journal can be emptied.  In a background-thread | |||
|   implementation the journal could be a ring buffer and all syncing would | |||
|   take place on the background thread. | |||
|   </blockquote> | |||
|   <blockquote> | |||
|   (This is related to a | |||
|   [http://www.mail-archive.com/sqlite-users@sqlite.org/msg35212.html replay journal discussion].) | |||
|   </blockquote> | |||
| Pros | |||
| * All databases could benefit without requiring individual code changes. | |||
| Cons | |||
| * More complicated than Grow File in Large Hunks Approach | |||
| * Data is written twice and so there is twice as much IO.  But this could be counter balanced by disabling sqlite's journal provided transaction completion can be detected by the vfs. | |||
| swag: 2-3 months | |||
Latest revision as of 15:02, 16 July 2008
Problem
Below are potential solutions that were discussed on an email thread between Mozilla, Linux and SQLite developers.
Partitioning Approach
Summary from Dr. Hipp:
Presumably you have some large tables in your main database file that you consult frequently and update occasionally. I'm suggesting splitting each table in two. The long-term persistent data is stored in the persistent database. Recent changes are stored in a separate TEMP tables. As I pointed out earlier, TEMP tables do not use fsync. So you can write to temp tables as much as you want without getting into fsync problems. But if you put your changes in TEMP tables, that means whenever you query the database, you have to do a more complex query that pulls in information from the main table and also overlays any changes or additions from TEMP tables. [sic] Then you have a separate thread that every minute or so transfers all of the accumulated information in the TEMP tables over to the main persistent database as a single transaction. That gives you a total of 2 or 3 fsyncs per minute, which even the most broken file system should be able to handle with ease.
Pros:
- Temp tables are in-memory, so there are no filesystem writes when inserting into the temp table.
Cons:
- Adds significant complexity to queries. However, this can be mitigated greatly by using VIEWs that join the temporary table with the permanent one. VIEWs cannot be written to though, so we'll either have to figure out which table to update (which can cause writes if it is the permanent table) or copy the data into the temp table. For the latter solution, when we sync the temporary table we would have to use INSERT OR REPLACE to update the permanent table. I'm not sure how this will effect UNION statements however (ordering may matter here - temp table first? UNION drops duplicated data as far as I know).
- Have to track primary key max index ourself for inserts, since the temp table won't be able to AUTOINCREMENT like the main table. We can mitigate this by having sqlite start at a certain value. See this e-mail thread and this page for more details.
- How do we update existing data? discussion list question
Tasks:
- determine which table(s) need to be partitioned
- initialize temp table(s)
- modify all queries that interact w/ the partitioned table(s)
- periodic flush of temp to permanent table (ensuring pk sync)
swag: 3+ weeks
Long Open Transaction Approach
Summary from Shaver:
the long-open-transaction with forced commit for a bookmark addition or other thing needing durability
From Shaver's friend:
... don't have sqlite do a sync transaction each time, instead use batching of transactions at the sqlite level. I suspect you aren't SO desparate
that every last page visit is saved in case of a crash, but rather just want to avoid corrupting the database. This can make a huge difference for sqlite performance, and can amortize the sync overhead over many operations. This will still allow sqlite to be ACID but only periodically doing flushes to disk (you would probably close and re-open transactions
for N bytes/pages/seconds).
Pros:
- Less overall fsycns
- Easy (almost painless)
Cons:
- Can lose data if we crash with an open transaction
- Still involves writes to disk
- If anyone else (non-places) fsyncs, we still lose
- Can no longer make a group of operations atomic with transactions (no nested transaction support).
swag: 1-2 weeks
Split Database Approach
From Shaver:
We could also just use two databases, one for history (with sync=off) and one for bookmarks (with sync=normal) and query against them both. Then you would get an fsync/s_f_r only when you updated a bookmark, and not during normal browser operation, and we would not be risking loss of bookmark data during crash.
Pros:
- The common task of adding a history visit is cheaper
Cons:
- Likely false assumption that history is less important than bookmarks (smart location bar changed this)
- Tight coupling between the two databases
- New history visits require a write to both (likely)
swag: 1-2+ months
Grow File in Large Hunks Approach
Summary from Andrew Morton:
If the file does need to grow during regular operation then I'd suggest that it be grown in "large" hunks - say, extend it (with write()) by a megabyte at a time. Then fsync it to get the metadata written. Then proceed to use sync_file_range() for the now-non-extending small writes. So the large-write-plus-fsync is "rare".
Pros:
- Can reasonably easily be implemented in vfs sqlite3_io_methods.
Cons:
- Only a solution for Linux.
- Still waits on (less significant disk access) regularly, and so may need to be combined with async IO or another approach for optimal behavior.
swag: 2 weeks
Journaling Virtual Filesystem Approach
Summary from Karl:
By implementing a virtual filesystem with data journaling, the vfs can sync and empty its journal when it chooses. The filesystem would satisfy the SQLITE_IOCAP_SEQUENTIAL characteristics and so the database would remain consistent even when it has synchronous == OFF.
Such a virtual filesystem implemented using sqlite3_io_methods could log changes immediately to its own replay journal file on the underlying filesystem, and maintain a lookup table of the pages that have not yet been written to their specific real files.
Pages in the journal would need to be occasionally synced (possibly with sync_file_range), and then copied to their real files. The pages in the real files would need to be synced before their corresponding journal pages become free to be reused. In a single-thread implementation, this could be done when the lookup table is too large to maintain efficiently and then the journal can be emptied. In a background-thread implementation the journal could be a ring buffer and all syncing would take place on the background thread.
(This is related to a replay journal discussion.)
Pros
- All databases could benefit without requiring individual code changes.
Cons
- More complicated than Grow File in Large Hunks Approach
- Data is written twice and so there is twice as much IO. But this could be counter balanced by disabling sqlite's journal provided transaction completion can be detected by the vfs.
swag: 2-3 months