Extracting meta-data from pages

From MozillaWiki
Jump to: navigation, search

Goal

Pancake currently only looks at page urls and titles. Investigate what we can find out about pages by looking at their content. Example are:

  • Page structure. headings, article text, etc.
  • Meta tags: icons, authors, etc.
  • Embedded micro formats like recipes, contacts, geo-information, etc.

We should find out how easy is it to find and extract this information and see if a big enough number of pages has useful information that we can do something with it.

How we use the extracted information for generic results and maybe very domain-specific results like for example "people", "recipes", "locations".