User:Archaeopteryx/Concept:Personal web

From MozillaWiki
Jump to: navigation, search

Today the internet is an essential part of work and life, people have personal browsing habits and some things what they see and read they want to remember later or give to someone else. These pages define very good what the user wants, so they should be used for customization. Furthermore, addresses aren't available for infinity, i. e.

  • subscribed or temporary available content
  • broken or no redirects after site redesign/restructure
  • content or site removed

Scrapbook and Scrapbook+ (the latter has performance improvements and was created because the author of the former one didn't respond to the latter one) are first concepts for storing pages offline, Scrapbook won also a Firefox Contest in the past. But the code hasn't improved for a while and performance is pretty bad for large files. Furthermore, it lacks a proper bookmarks integration.

Goals

  • Integration into bookmarks
  • Weave integration
  • Backend support for fulltext address bar search
  • Backend for extensions which detected updated webpages by source code comparing
  • Making documents and files offline available, but be still able to update them via Firefox
  • Save multiple captures of a web page


User interface

Sidebar

  • Different color and underlining for captured bookmarks
  • Opening with Alt + Klick? or rightclick (and there default opening behavior for different clicks)
  • if more than one capture: show as folder-bookmark hybrid (extensible folder)

Address bar

  • first capture: middleclick
  • deleting: rightclick menu
  • indicating captured page by icon (bookmark star with a book or page in background): Alt + arrow right to switch to latest captured version

Search

  • Address bar
  • Sidebar
  • Integrate into Library
  • Hooks for desktop search engines
  • Allow scripts to set metadata from files, i. e. author, date, title, description etc.

Updating

  • At least hooks for automated
  • Update all in folder if you user desires

Processing

Input filter/manipulation

  • HTTPS pages should be excluded by default
  • HTML5 specifies content and non-content parts of pages, the latter ones not be indexed
  • Often, only a part of the content of a web page is interesting to the user, i. e. the meaining part is navigation, advertising or unrelated stuff. People often want to access the main content, so a rule based capturing of a part or only parts of web pages would make sense. With HTML 5, content can be classified, but the user can get the best results if they generate the filters themselves. I. e., Xpath support, JavaScript manipulation. If possible, the original page structure should be stored to allow most post-capture processing.
  • Non-visible (= hidden) content (nodes) should not be stored (many sites do this for print pages).
  • All content which doesn't get used by output filters should be stored into one archive (.jar) per page (tiny files cause a large overhead because of the disk sector size, i. e. NTFS with 4096 bytes)

Output filter/manipulation

Certain pages or part of the pages should be accessible from normal file browsers, i. e. media files the user wants to play in an external media player, comic strip images, PDF files or html content related to a topic and which needs to be document external (i. e. for legal reasons). Allowing to process exported files by an external program (i. e. media converter) would be a nice-to-have.

Post-capture processing

Applying customizations after the capturing is a fundamental part of an extensible data scheme because customization scripts will be written with the use of the storage system. Recapturing the page and applying the script probably won't work always because of only temporary available or IP-restricted content.