Marketplace/HAtesting

< Marketplace
Revision as of 20:01, 26 October 2012 by Telliott (talk | contribs) (Created page with "=Introduction= As part of HA testing, we need to figure out what happens when various pieces of the system breaks with a long-term goal of making sure that these problems do n...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Introduction

As part of HA testing, we need to figure out what happens when various pieces of the system breaks with a long-term goal of making sure that these problems do not bring the rest of the system down. Each of these should be simulated while the system is under load. Simulations should reflect likely areas of degradation - unavailablility, slowness, desynchronization.

As we move to a more SOA-style marketplace, many of these will become easier to test, as the points of contact will be more controlled and easier to identify. Some, such as the mysql master, will always be central to the system, though.

Below is a list of tests that should be run to test HA readiness, including their results. A failed test is not necessarily a problem - after all, if the master load balancer goes down, you're not going to have a good time - but helps us to identify areas to prioritize in becoming HA.

Identified components/problems that could occur

Webserver

webserver dies

Simulation: While running under load, shut down the webserver on one of the frontend machines

Result:

Mysql

=Mysql master dies

Simulation: While running under load, perform a mysql shutdown on the master DB

Result:

Mysql switches masters

(see https://bugzilla.mozilla.org/show_bug.cgi?id=804255) Simulation: While running under load, trigger a failover to a new master DB

Result:

Mysql slave dies

Simulation: While running under load, do a mysql shutdown on one of the non-master dbs

Result:

Mysql load balancer dies

Note that this is the equivalent of just turning off all of mysql. We're not expecting to survive this, just to see how gracefully the frontend handles it.

Simulation: While running under load, tell the load balancer to stop serving traffic.

Result:

Slow Mysql replication

Simulation: While running under load, turn off replication to one of the mysql slaves.

Result:

Slow Mysql processing

Simulation: While running under load, delay all queries from mysql by 10/20/30s. (we can obviously delay the connection, can we write a proxy that does a sleep in mysql?)

Result:

Elastic Search

Elastic Search node dies

Simulation: While running under load, bring down one of the Elastic Search nodes. Are there visible changes to the site.

Result:

Elastic Search load balancer dies

Processing Queues and Automated Tasks

Celery dies

Rabbitmq dies

Cron jobs stop running or die

Redis

Redis node dies

Redis dies

Memcache

memcache dies

memcache load balencer dies

Signing Services

Receipt signing service unavailable

Receipt signing service slow

JAR signing service unavailable

Payments

payment processing server dies (solitude)

payment processing load balencer dies (solitude)

payment gateway server dies (webpay)

payment gateway load balancer dies (webpay)

payment service (eg bango or paypal) dies

Monitoring

webtrends/analytics goes down

statsd/graphite/sentry goes down

syslog, cef or metlog goes down

Miscellaneous

backend storage for images and applications dies

recaptcha unavailable

browserid unavailable

email server unavailable

outgoing.mozilla.org

REALLY BIG STUFF

dns resolution dies

cdn dies

webserver front end load balancer dies

Simulation: While running under load, perform a shutdown on the web load balancer

Result: