- 1 Introduction
- 2 Identified components/problems that could occur
- 2.1 Webserver
- 2.2 Mysql
- 2.3 Elastic Search
- 2.4 Processing Queues and Automated Tasks
- 2.5 Redis
- 2.6 Memcache
- 2.7 Signing Services
- 2.8 Payments
- 2.9 Monitoring
- 2.10 Miscellaneous
- 2.11 REALLY BIG STUFF
As part of HA testing, we need to identify single points of failure within the system, figuring out what happens when those pieces break with a long-term goal of making sure that these problems do not bring the rest of the system down. Each of these should be simulated while the system is under load unless otherwise noted. Simulations should reflect likely areas of degradation - unavailablility, slowness, desynchronization.
As we move to a more SOA-style marketplace, many of these will become easier to test, as the points of contact will be more controlled and easier to identify. Some, such as the mysql master, will always be central to the system, though.
Below is a list of tests that should be run to test HA readiness, including their results. A failed test is not necessarily a problem - after all, if the master load balancer goes down, you're not going to have a good time - but helps us to identify areas to prioritize in becoming HA.
Identified components/problems that could occur
Simulation: While running under load, shut down the webserver on one of the frontend machines
Mysql master dies
Simulation: While running under load, perform a mysql shutdown on the master DB
Mysql switches masters
(see https://bugzilla.mozilla.org/show_bug.cgi?id=804255) Simulation: While running under load, trigger a failover to a new master DB
Mysql slave dies
Simulation: While running under load, do a mysql shutdown on one of the non-master dbs
Mysql load balancer dies
Note that this is the equivalent of just turning off all of mysql. We're not expecting to survive this, just to see how gracefully the frontend handles it.
Simulation: While running under load, tell the load balancer to stop serving traffic.
Slow Mysql replication
Simulation: While running under load, turn off replication to one of the mysql slaves.
Slow Mysql processing
Simulation: While running under load, delay all queries from mysql by 10/20/30s. (we can obviously delay the connection, can we write a proxy that does a sleep in mysql?)
Elastic Search dies
Simulation: While running under load, bring down Elastic Search nodes.
Elastic Search is slow
Simulation: While running under load, make Elastic Search reponses +30 seconds
Elastic Search node dies
Simulation: While running under load, bring down one of the Elastic Search nodes. Are there visible changes to the site.
Elastic Search load balancer dies
Processing Queues and Automated Tasks
Cron jobs stop running or die
Simulation: While the site is running normally, turn off all marketplace cron jobs for 48 hours. Examine the site for notable deviations from expected values.
(Note: need to identify all the crons. May need to call some out individually)
Redis node dies
Simulation: While running under load, bring down one of the redis nodes.
Redis responds slowly
Simulation: Add 1s delay to all redis calls
Memcache node dies
Simulation: While running under load, bring down a single memcache node.
Simulation: While running under load, turn memcache off on all memcache nodes
Memcache responds slowly
Simulation: While running under load, add a 1s delay to all results coming from memcache
Simulation: While running purchasing and receipt verification, shut off receipt signing.
Receipt signing service slow
Simulation: While running purchasing and receipt verification, add a 20s/30s delay to the query. (Question: how long will the client hold the connection open? We should test both sides of that)
Simulation: While testing the approval process, turn off the JAR signing service
Single payment gateway server dies (Webpay)
Simulation: While users attempt to make a purchase, kill a gateway server.
Payment gateway load balancer dies (Webpay)
Simulation: While users attempt to make a purchase, kill the load balancer. This effectively removes the payment service.
Payment processing server dies (Solitude)
Simulation: While attempting to make purchases, bring down one of the Solitude servers
Payment processing load balancer dies (Solitude)
Simulation: While attempting to make purchases, bring down all of Solitude.
Payment service (Bango or Paypal) dies
Simulation: While making purchases, sever the connection between the payment servers and paypal (blackhole address in the configuration?)
Webtrends/analytics goes down
Statsd/graphite/sentry goes down
Syslog, CEF or Metlog goes down
Backend storage for images and applications dies
Simulation: While running under load, turn off the nfs hosting our images and applications
Simulation: Blackhole the ip for recaptcha and attempt to register an account.
Simulation: Turn off browserid (or blackhole it), then attempt to use the site. Should include some registration.
Outgoing.mozilla.org not responsive
REALLY BIG STUFF
DNS resolution dies
Simulation: (I have no idea how to test this one)
Simulation: Is this realistically testable? It's out of our hands entirely.
Webserver front end load balancer dies
Simulation: While running under load, perform a shutdown on the web load balancer