Marketplace/FailureModes

From MozillaWiki
Jump to: navigation, search
Stop (medium size).png
The Marketplace has been placed into maintenance mode. It is no longer under active development. You can read complete details here.

Overall Summary

Most of the Marketplace is similar to a website and should have expected uptimes commensurate with up.

Additionally, there are several services associated with the marketplace. For those, short-duration failures (< 1 hour) are likely to go unnoticed, and have no impact on the user.

However, there are a few points that have implications beyond a user being unable to access the website and will impact the user directly. These are:

  • App receipt generation
  • App receipt verification
  • Payment (deferred)

Below, we will look at the technical components of the Marketplace, followed by breaking it down into various sections and seeing how failure of the various components will impact user experience. At the stage, the Marketplace is treated as a monolithic project; as things progress, the goal is to move to a more Services-Oriented Architecture, at which time the pieces can be moved into their own pages for failure analysis.

At present, Marketplace exists in one colo. Since the goal is to have it installed in multiple colos, notes of where replication lag may have impact on user experience are also noted below.


Component Details

Network

There's an implicit Network that can fail - if we can't get to the server, we can't get content. Result in all situations is basically the same as it is for the rest of the system failing, though it's not necessarily a failure on our end (and many users will not know whether it is or not, thanks to phone flakiness)

Zeus

Zeus is used as a load balancer and for caching a few very-high-traffic pages. Our ops team has a love/hate relationship with it and is looking at other solutions.

Memcache

Memcache does a lot of DB object caching throughout the system. If memcache goes down, it's likely that the DBs would be rapidly overwhelmed due to the sudden influx of queries.

Database

Heavily normalized MySQL. DBs are at the core of all the data, including translations for the various pages. It is currently not replicating to different colos.

Elastic Search

Does searches across the apps, which are then filled in through the object cache.

Persona

Persona is the '3rd party' login solution for identifying yourself to the Marketplace. While it is a Mozilla product, it will be used in areas well beyond the Marketplace and should be treated separately.

If Persona is unavailable, the site should continue to work for browsing. However, anything that requires user identity - notably app receipt verification - will fail.


Section Summary and Failure Modes

Discovery Pages

Failure Points

  • Zeus cache fails
  • Object cache fails

User Visibility

High traffic page as it's the entrypoint for browsers. Failure will show up when a user visits the Marketplace in the browser, and will likely be unable to continue

Notes

Content is basically static, so it's cached by Zeus.


Homepage/Category Pages

Failure Points

  • Object cache fails
  • Elastic Search unavailable

User Visibility

High. Problems here mean that the user will probably not be able to progress through the site. However, it will not break any app functionality.

Notes

These pages are pretty static and built from the object cache plus some Elastic Search.

App Pages

Failure Points

  • App DB fails
  • Review DB fails
  • Ratings DB fails
  • Object cache fails
  • Reviews/ratings replication falls behind

User Visibility

High for popular apps. Users won't be able to install an app if they can't get to it. Users will usually not care if ratings and reviews are temporarily unavailable. The potential exception to this is is a user posts a review and it doesn't show up immediately.

Notes

Receipt Verification

Failure Points

  • Purchase Database unavailable or corrupted
  • Purchase Database behind on replication
  • Signing keys unavailable
  • Signing service unavailable (for receipt updates)

User Visibility

Apps will fail to work. High user visibility and inconvenience.

Notes

Does not apply to free apps, as they do no receipt checks.


Receipt Signing

Failure Points

  • Expired key from HSM server

User Visibility

Users will be unable to purchase an app.

Notes

Signing a receipt should be done before charging, to protect from a failure in that flow causing a user to be billed.


Version Check

Failure Points

  • Version Database corrupted or unavailable
  • Version Database replication behind
  • Stale feed into database (not currently, but several models will have this)

User Visibility

Minimal. Failure should produce "no update" to the user, and impact is minimal. They'll pick it up on the next check, and the delay is not important.

Notes

Operational costs here make this a good area for serious examination. There are probably relatively easy wins here, and eventually we might look to improve the FF API itself.


Payments

Not applicable in current version. Process is entirely handled by BlueVia


Blocklist

Failure Points

  • Generation of static file fails

User Visibility

No user visibility, as a failure will just cause clients to not update the blocklist. However, because of the nature of items on the blocklist, delays or erroneous content can have security consequences.

Notes

App Install Process

Failure Points

  • Payment processing failure
  • Receipt Signing failure
  • App download inaccessible
  • Fail to write purchase into DB

User Visibility

Attempted purchases of apps will fail. Whether that's more than a minor inconvenience for the user depends on if we're past the payment process. Failing to write a purchase after it has been made is vary bad and needs a lot of logging.

Notes

In general, we'll want a ton of logging throughout this process and a good interface to it so that we can track down reported issues.

If we have a record of a user paying for an app, can they redownload it at will?


App Search

Failure Points

  • Elastic Search unavailable
  • Feed for Elastic Search behind
  • Memcache layer unavailable

User Visibility

Search results will be unavailable for a period of time.

Notes

Developer Workflow

Failure Points

  • Login failure (see above)
  • Application DB failure
  • Metrics compilation failure

User Visibility

Very low. Downtime here is unlikely to affect site use, as it represents a mild inconvenience for the app developer. Accuracy of usage statistics is important, as it will correlate back to total number of receipts. If that doesn't match expected values, people will notice!

Notes