CloudServices/Sync/FxSync/Syncorro: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 196: Line 196:
== Tentatively identified as not in scope for v1 ==
== Tentatively identified as not in scope for v1 ==


* Ops paging/integration for events. A large spike in failures could be either a new client error or a server or
* Ops paging/integration for events. A large spike in failures could be either a new client error or a server or operational issue, and that's info that we might want to leverage. Best to leave this until we know what we're doing.
operational issue, and that's info that we might want to leverage. Best to leave this until we know what we're doing.

Revision as of 06:36, 28 July 2011

Socorro + Sync = Syncorro \o/

People

  • Client engineering: Marina Samuel, Philipp von Weitershausen
  • Server engineering: XXX
  • Metrics: Daniel Einspanjer, Xavier Stevens
  • Product: Jennifer Arguello

Goals

  • Gather statistics on errors (to help with prioritization)
  • Be able to correlate errors with maintenance windows, user profiles, etc.
  • Simplify error reporting for users who file bugs or SUMO articles
  • Detect the "long tail" of problems that are never filed

Features

  • Each submitted report should be represented by a URL or at least an opaque token (e.g. UUID)
  • Ability to query according to application, Sync, and error specific metadata
  • Fulltext search over submitted log data
  • Ability to return instructions to client upon report submission (e.g. throttling, recovery, support messages for the user, etc.)

Roadmap

  • Discuss goals and features with metrics (DONE)
  • Discuss UI mockups with UX
  • Add ability to upload Syncorro data to ElasticSearch (see bug 673318)
  • Build add-on for the Services Beta Channel

Client UX

Submitting an error report

  • When there's a Sync error, the usual error bar is shown (except we're not showing the dreadful Unknown Error message):
 -------------------------------------------------------------------------
 | We're sorry, Sync encountered a problem [Details ...]              (X)|
 -------------------------------------------------------------------------
  • Clicking on the Details button dismisses the bar and brings up a tab with a high-level explanation of the details of the error:
 ------------------------------------------------------------------------
 | Details for Sync problem on Tuesday, May 1, 2011 5:59 pm             |
 | ========================================================             |
 |                                                                      |
 | There was a problem saving the "BBC News - World" bookmark to your   |
 | computer. Other data is not affected.                                |
 |                                                                      |
 | To help Mozilla improve Sync and prevent errors like this in the     |
 | future, please submit this report. Your personal data will not be    |
 | submitted.                                                           |
 |                                                                      |
 | [X] Automatically submit reports in the future.                      |
 |                                                                      |
 |                                                     [Submit report]  |
 |                                                                      |
 | > Full report                                                        |
 |                                                                      |
 ------------------------------------------------------------------------
  • Pressing the Submit report button will submit the report. Once the report is submitted, a link to the report on the server is displayed:
 ------------------------------------------------------------------------
 | Details for Sync problem on Tuesday, May 1, 2011 5:59 pm             |
 | ========================================================             |
 |                                                                      |
 | There was a problem saving the "BBC News - World" bookmark to your   |
 | computer. Other data is not affected.                                |
 |                                                                      |
 | Firefox submitted a report of the problem to Mozilla.                |
 |                                                                      |
 | [X] Automatically submit reports in the future.                      |
 |                                                                      |
 | > Full report                                                        |
 |                                                                      |
 ------------------------------------------------------------------------
  • If the Syncorro server finds a suitable support page, the page will display:
 ------------------------------------------------------------------------
 | Details for Sync problem on Tuesday, May 1, 2011 5:59 pm             |
 | ========================================================             |
 |                                                                      |
 | There was a problem saving the "BBC News - World" bookmark to your   |
 | computer. Other data is not affected.                                |
 |                                                                      |
 | Good news! Firefox submitted a report of the problem to Mozilla and  |
 | a possible solution was found. _View_support_page_                   |
 |                                                                      |
 | [X] Automatically submit reports in the future.                      |
 |                                                                      |
 | > Full report                                                        |
 |                                                                      |
 ------------------------------------------------------------------------
  • Click on the arrow next to Full Report will show all information that potentially is or was submitted. Since there's typically a lot of it, it's divided into separate collapsible sections itself:
 ------------------------------------------------------------------------
 | Details for Sync problem on Tuesday, May 1, 2011 5:59 pm             |
 | ========================================================             |
 |                                                                      |
 | There was a problem saving the "BBC News - World" bookmark to your   |
 | computer. Other data is not affected.                                |
 |                                                                      |
 | Firefox submitted a report of the problem to Mozilla.                |
 |                                                                      |
 | [X] Automatically submit reports in the future.                      |
 |                                                                      |
 | \/ Full report                                                       |
 |                                                                      |
 |    Report ID: {UUID}  [Copy to clipboard]                            |
 |                                                                      |
 |    > Application details                                             |
 |    > Sync account info                                               |
 |    > Error fingerprint                                               |
 |    > Log                                                             |
 |                                                                      |
 ------------------------------------------------------------------------

Looking up error reports

  • Basically make about:sync-log look like about:crashes, linking to the details pages as described in the previous section.

Client Implementation

Note: This is only a draft that is being fleshed out.

  • Using Metric's Elastic Search system (also used for AMO stats and Socorro) at data.mozilla.org
  • On error, Sync POSTs a payload to data.mozilla.org:
 POST /XXX HTTP/1.1
 Content-Type: application/json
 {
   id: "{UUID}",
   app: {
     product: "{UUID}",
     version: "8.0a1",
     buildID: "...",
     locale: "en_US",
     addons: ["{UUID}", "{UUID}", "{UUID}", ...]
   },
   sync: {
     version: "1.10",
     account: "eisklclxuauemrjghidis",
     cluster: "https://phx-sync091.services.mozilla.com/",
     engines: ["bookmarks", "history", ...],
     numClients: 2,
     mobileClients: true
   },
   error: {
     localTimestamp: 13294938593,
     engine: "bookmarks",
     result: 489294595, // the error constant if applicable
   },
   log: "..."
 }
  • Under normal conditions, the server returns HTTP 200 OK with optional hints for the client concerning throttling and help for the user:
 HTTP/1.1 200 OK
 Content-Type: application/json
 {
   reportURL: "http://data.mozilla.org/syncorro/{UUID}",
   throttle: 10,  // only submit every 10th error
   infoURL: "http://support.mozilla.com/..."  // optional support page
 }
  • Server can also return other status codes to indicate that the data wasn't accepted.
    • 500 Server Error
    • XXX throttled, try again later
    • XXX invalid data
  • If the client fails to upload the report (e.g. because of network connectivity problems or similiar), it will retry periodically using a backoff strategy. After some number of failures, the upload is failed permanently, and no further retries will be attempted.

Dashboard implementation

  • Graph of number of reports over time (potentially being able to split by certain metadata, e.g. product version, Sync node, etc.)
  • Query by metadata
  • Fulltext search over logs
  • Define SUMO pages for percolator matches

TODO details (talk to ddash, jbalogh)

Questions

  • Reports will probably have to be non-public for now, though it would be nice if users could view their own submitted reports... can we do some sort of token-based auth there?
  • Will this service require ToS changes?
  • What do we do with custom server users?
  • What do we do when user has Trace logging enabled?

Discussion

Tentatively identified as not in scope for v1

  • Ops paging/integration for events. A large spike in failures could be either a new client error or a server or operational issue, and that's info that we might want to leverage. Best to leave this until we know what we're doing.