Marketplace/HAtesting
Introduction
As part of HA testing, we need to figure out what happens when various pieces of the system breaks with a long-term goal of making sure that these problems do not bring the rest of the system down. Each of these should be simulated while the system is under load. Simulations should reflect likely areas of degradation - unavailablility, slowness, desynchronization.
As we move to a more SOA-style marketplace, many of these will become easier to test, as the points of contact will be more controlled and easier to identify. Some, such as the mysql master, will always be central to the system, though.
Below is a list of tests that should be run to test HA readiness, including their results. A failed test is not necessarily a problem - after all, if the master load balancer goes down, you're not going to have a good time - but helps us to identify areas to prioritize in becoming HA.
Identified components/problems that could occur
Webserver
webserver dies
Simulation: While running under load, shut down the webserver on one of the frontend machines
Result:
Mysql
=Mysql master dies
Simulation: While running under load, perform a mysql shutdown on the master DB
Result:
Mysql switches masters
(see https://bugzilla.mozilla.org/show_bug.cgi?id=804255) Simulation: While running under load, trigger a failover to a new master DB
Result:
Mysql slave dies
Simulation: While running under load, do a mysql shutdown on one of the non-master dbs
Result:
Mysql load balancer dies
Note that this is the equivalent of just turning off all of mysql. We're not expecting to survive this, just to see how gracefully the frontend handles it.
Simulation: While running under load, tell the load balancer to stop serving traffic.
Result:
Slow Mysql replication
Simulation: While running under load, turn off replication to one of the mysql slaves.
Result:
Slow Mysql processing
Simulation: While running under load, delay all queries from mysql by 10/20/30s. (we can obviously delay the connection, can we write a proxy that does a sleep in mysql?)
Result:
Elastic Search
Elastic Search node dies
Simulation: While running under load, bring down one of the Elastic Search nodes. Are there visible changes to the site.
Result:
Elastic Search load balancer dies
Processing Queues and Automated Tasks
Celery dies
Rabbitmq dies
Cron jobs stop running or die
Redis
Redis node dies
Redis dies
Memcache
memcache dies
memcache load balencer dies
Signing Services
Receipt signing service slow
Payments
payment processing server dies (solitude)
payment processing load balencer dies (solitude)
payment gateway server dies (webpay)
payment gateway load balancer dies (webpay)
payment service (eg bango or paypal) dies
Monitoring
webtrends/analytics goes down
statsd/graphite/sentry goes down
syslog, cef or metlog goes down
Miscellaneous
backend storage for images and applications dies
outgoing.mozilla.org
REALLY BIG STUFF
dns resolution dies
cdn dies
webserver front end load balancer dies
Simulation: While running under load, perform a shutdown on the web load balancer
Result: