Support:GSOC Project Scope and Timeline

From MozillaWiki
Jump to: navigation, search

Timeline

June 25: end of term for GSOC student

June 25: re-connect, follow up meeting

June 26 - June 28: Install Sphinx and sumo on development server.

June 29 - July 4: Develop indexing engine

July 7 - July 10: Develop filtering and weighting engine

July 11 - July 15: Develop search component and search UI

July 16 - July 23: Develop fudge factor improvements

July 24 - July 31: Refinements

Aug 4 - Aug 20: Load testing, UI improvement, caching (these are considered not part of GSOC scope)

Components

Indexing Engine

Sphinx based - triggered as batch job, access Tiki db directly

Filtering and Weighting Engine

Extended tables based on Sphinx - custom UI for admin to add remove weights.

Weights stored with index for performance reasons.

Search component

Replaces Tiki lib/searchlib.php - searches index and returns results based on given parameters

Search UI

Replaces Tiki tiki-searchindex.php and tiki-searchindex.tpl to provide front-end UI to search

Scope

Source Data

Data will come from knowledge base and forums. The system will be extensible to other Tiki features but this project will only cover kb and forums.

Filtering

Data searched for will be filterable by:

  • kb vs. forums
  • by forum thread state (forum threads that are answered)
  • by article type (help vs. troubleshooting)
  • by category
  • by author of article
  • by contributors to the forum thread
  • by freshness of data (last modified for wiki pages, and last post date for forum threads)

Filtering information will be part of the index, to speed performance.

Localization

This is just another type of filtering.

Locale information will be in the index as well.

Searching for "translations of search terms" is beyond the scope of this project.

Returning of search results that include translations based on user defined fallback is beyond the scope of this project.

Weighting

This will be done based on:

  • source type (article vs. forum)
  • each source type field can be weighted, e.g. title, vs description.
  • existence of search term in freetags
  • poll results

Weighting info will be stored in the index for performance reasons.

Indexing

This will be a batch job.

The last modified fate of an article could be used as a means to speed indexing (avoid unnecessary reindexing).

Indexing should not include tiki syntax.

Searching

Need to support for boolean logic in searching for search terms – OR, AND, NOT.

Caching of search results

Need to be done, but not part of GSOC project - to be scheduled separately.

Fudge factor

Handle spelling errors ("did you mean...").

Synonyms (searching for "favorites" also searches for "bookmarks")

Ignores locale-specific common words ("the", "a", "Firefox") - this will be limited to English for the scope of this project, but will be extensible.

Display of search results

Show the title of the page, the first paragraph (actually the description field). (the text surrounding the text matched is not in this project)

Display results as plain text without Tiki formatting (description field will not have Tiki formatting)

Show data on the article - such as poll results - will be based on info in index only - to improve performance.

“More like this” is a separate thing and should be considered out of scope of this project.