Marketplace/ComponentSLA
Contents
Availability Tiers
There are three tiers of service within the infrastructure:
Tier 1 - Contracted Availability
These services/workflows have been defined as mission-critical, and have explicitly set levels of availability or maximum durations for processing. As the specific requirements for each piece will be unique, they will be discussed individually below. Inclusion of them in a MOU is expected.
These pieces of the site must have:
- Installations in multiple datacenters
- Automated failover on error detection
- Replicated data
Tier 2 - High Availability
Most of the systems fall in this category. They are monitored closely and have an expectation that the system will be up. However, there is no contract specifying a value for this, and it is also acceptable for the system to be down for planned maintenance.
Internally, there may be differences in expection for components in the infrasturcture, but that will be reflected in development resources assigned and exact monitoring details for each component.
Tier 3 - Best Effort
These systems are low-priority, usually documentation, forums or read-only sections. While we will do our best to keep them up, there won't be extensive effort through redundancy or geodistribution to make them High Availability. Even with these qualifications, we still expect them to be up over 98% of the time.
Specific Tier 1 System Components
These components have been identified as critical pieces of the infrastructure. Note that this does not mean that they will be up 100% of the time, but that there is a determined minimum, and not having it available will either cause economic harm, or need to be compensated for by the client pieces of the system.
Application Download
Once a user has paid for an application, they expect to be able to download it immediately. This is especially important for regions where connectivity may not be constant.
Flow
The user has paid for an application and is then directed to download it from a static location.
Failure Scenarios
The static location is not available
Alternate Paths
Guarantees
Download availability: 99.5% (may be higher with some solutions)
Dev Plan
Hosting all Application bundles on Akamai is a straightforward solution to this and already has a lot of the infrastructure in place.
Open Questions
Application Removal
Sometimes, an application needs to be removed from the Marketplace. This may be due to a dangerous coding error, a security issue, or legal concerns. Both the owner of the application and administrators of this system need access to this capability.
Once the application has been removed, it needs to leave the system such that it doesn't show up on category pages, in searches, or as an app page within a certain time frame.
Flow
The user logs on and selects the delete-now button. The item is removed from the database (cached elsewhere offline?) and affiliated searches. Caches involving it are flushed.
Failure Scenarios
- User cannot log in
- Database Error prevents deletion
- Caches do not flush
Alternate Paths
In the event that a user cannot remove an application, they should be presented with a hotline that lets them communicate with an admin to do the deletion. The admin channel should be separate from the user channel, and may have more direct access to the system, as well as the ability to manually flush items from the cache if needed.
Guarantees
Need a guarantee for big-red-button uptime: 99.5%
Need a guarantee for removal speed: 15 minutes
Dev Plan
Open Questions
- Will there be an admin on call 24/7?
- How does the user identify to the admin that they have the right to remove the app.
Identity
Do we handle this, or is this SLA the responsibility of the identity team?
The uptime here needs to be very high, because all those other contracted pieces have a dependency on this
Receipt Signing
As part of the purchase process, receipts need to be signed using a two-tiered key system (where one key is the actual signing key and rotates, and one key is the internal master used to certify the rotating keys).
In theory, we will also need to sign packaged apps in this fashion, though exact details are still TBD.
Flow
A receipt is handed off to the service, which decomposes the tokens in the payload and uses it to generate a signed receipt. API details are in https://wiki.mozilla.org/Apps/WebApplicationReceipt/SigningService
Failure Scenarios
- Service inaccessible
- Corrupt/invalid receipt
- Signing Key is outdated
- Public Key unavailable
Alternate Paths
Guarantees
Without the service, receipts cannot be issued post-payment. Because this is a highly visible negative user experience, we need to make sure the service is up and responsive. 99.5% recommended. Going higher requires an escalation path that involves 24/7 teams.
Dev Plan
There is no database or state associated with this service, which reduces the number of things that can go wrong substantially. Receipt parsing failures cannot be handled by the signing service.
Signing should be available in multiple colos, with a fallback if the first one does not resolve. Outdated keys should be watched for - new keys are generated with plenty of ramp time, so if the replacement key is not in place, alerts should be going off.
Public keys are static files, so we need to ensure they are in place, or redirecting properly. Can make use of a CDN, as these do not change.
Open Questions
Support for Packaged Apps?
Receipt Verification
We need the ability to verify that a user has purchased an application and continues to have the right to use it (due to something like refunding). This ensures that the user and the product are tied together. Rotation of the receipt is needed in case the key used to sign the receipts is leaked and needs to be rotated. However, many calls will not require a full new receipt, just verification.
Free applications do not need to perform verification.
Flow
User issue a request to the the verifier, which checks in the purchase DB to make sure the user has purchased the app. If they have, but their receipt is outdated, a new receipt is generated, signed and returned to the user.
Failure Scenarios
- User cannot log in
- Database Error prevents retrieval of the receipt
- Database is behind and doesn't have receipt
- Signing service failure
Alternate Paths
Having the service in multiple colos with a client fallback as well as load balancing could prevent network problems.
Much of the user-visible problem can be mitigated with a little client intelligence. If the user has a receipt that needs reverifying and the server responds with an error other than "no valid receipt", then the client should consider it still good for some period of time.
Guarantees
Need a guarantee for receipt verification: 99.5%
Has a dependency on the signing service, which is also 99.5%. That may cause issues, though most calls will not be using the signing service.
Dev Plan
As Receipt verification is currently an integrated part of Marketplace, it will initially be a challenge to pull it out into its own SLA. As we move towards a more service-oriented architecture, it will become easier.
Open Questions
- Will there be an admin on call 24/7?
- Is this still the model for receipts?
- How do we handle low-cell-service areas where a user may be off internet for longer periods?