Marketplace/HAtesting

From MozillaWiki
Jump to: navigation, search
Stop (medium size).png
The Marketplace has been placed into maintenance mode. It is no longer under active development. You can read complete details here.

Contents

Introduction

As part of HA testing, we need to identify single points of failure within the system, figuring out what happens when those pieces break with a long-term goal of making sure that these problems do not bring the rest of the system down. Each of these should be simulated while the system is under load unless otherwise noted. Simulations should reflect likely areas of degradation - unavailablility, slowness, desynchronization.

As we move to a more SOA-style marketplace, many of these will become easier to test, as the points of contact will be more controlled and easier to identify. Some, such as the mysql master, will always be central to the system, though.

Below is a list of tests that should be run to test HA readiness, including their results. A failed test is not necessarily a problem - after all, if the master load balancer goes down, you're not going to have a good time - but helps us to identify areas to prioritize in becoming HA.

Identified components/problems that could occur

Webserver

webserver dies

Simulation: While running under load, shut down the webserver on one of the frontend machines

Result:

Mysql

Mysql master dies

Simulation: While running under load, perform a mysql shutdown on the master DB

Result:

Mysql switches masters

(see https://bugzilla.mozilla.org/show_bug.cgi?id=804255) Simulation: While running under load, trigger a failover to a new master DB

Result:

Mysql slave dies

Simulation: While running under load, do a mysql shutdown on one of the non-master dbs

Result:

Mysql load balancer dies

Note that this is the equivalent of just turning off all of mysql. We're not expecting to survive this, just to see how gracefully the frontend handles it.

Simulation: While running under load, tell the load balancer to stop serving traffic.

Result:

Slow Mysql replication

Simulation: While running under load, turn off replication to one of the mysql slaves.

Result:

Slow Mysql processing

Simulation: While running under load, delay all queries from mysql by 10/20/30s. (we can obviously delay the connection, can we write a proxy that does a sleep in mysql?)

Result:

Elastic Search

Elastic Search dies

Simulation: While running under load, bring down Elastic Search nodes.

Result:

Elastic Search is slow

Simulation: While running under load, make Elastic Search reponses +30 seconds

Result:

Elastic Search node dies

Simulation: While running under load, bring down one of the Elastic Search nodes. Are there visible changes to the site.

Result:

Elastic Search load balancer dies

Processing Queues and Automated Tasks

Celery dies

Simulation:

Result:

Rabbitmq dies

Simulation:

Result:

Cron jobs stop running or die

Simulation: While the site is running normally, turn off all marketplace cron jobs for 48 hours. Examine the site for notable deviations from expected values.

(Note: need to identify all the crons. May need to call some out individually)

Result:

Redis

Redis node dies

Simulation: While running under load, bring down one of the redis nodes.

Result:

Redis responds slowly

Simulation: Add 1s delay to all redis calls

Result:

Memcache

Memcache node dies

Simulation: While running under load, bring down a single memcache node.

Result:

Memcache dies

Simulation: While running under load, turn memcache off on all memcache nodes

Result:

Memcache responds slowly

Simulation: While running under load, add a 1s delay to all results coming from memcache

Result:

Signing Services

Receipt signing service unavailable

Simulation: While running purchasing and receipt verification, shut off receipt signing.

Result:

Receipt signing service slow

Simulation: While running purchasing and receipt verification, add a 20s/30s delay to the query. (Question: how long will the client hold the connection open? We should test both sides of that)

Result:

JAR signing service unavailable

Simulation: While testing the approval process, turn off the JAR signing service

Result:

Payments

Single payment gateway server dies (Webpay)

Simulation: While users attempt to make a purchase, kill a gateway server.

Result:

Payment gateway load balancer dies (Webpay)

Simulation: While users attempt to make a purchase, kill the load balancer. This effectively removes the payment service.

Result:

Payment processing server dies (Solitude)

Simulation: While attempting to make purchases, bring down one of the Solitude servers

Result:

Payment processing load balancer dies (Solitude)

Simulation: While attempting to make purchases, bring down all of Solitude.

Result:

Payment service (Bango or Paypal) dies

Simulation: While making purchases, sever the connection between the payment servers and paypal (blackhole address in the configuration?)

Result:

Monitoring

Webtrends/analytics goes down

Simulation:

Result:

Statsd/graphite/sentry goes down

Simulation:

Result:

Syslog, CEF or Metlog goes down

Simulation:

Result:

Miscellaneous

Backend storage for images and applications dies

Simulation: While running under load, turn off the nfs hosting our images and applications

Result:

Recaptcha unavailable

Simulation: Blackhole the ip for recaptcha and attempt to register an account.

Result:

Browserid unavailable

Simulation: Turn off browserid (or blackhole it), then attempt to use the site. Should include some registration.

Result:

Email server unavailable

Simulation:

Result:

Outgoing.mozilla.org not responsive

Simulation:

Result:

REALLY BIG STUFF

DNS resolution dies

Simulation: (I have no idea how to test this one)

Result:

CDN dies

Simulation: Is this realistically testable? It's out of our hands entirely.

Result:

Webserver front end load balancer dies

Simulation: While running under load, perform a shutdown on the web load balancer

Result: