Personal tools

Marketplace/HAResults

From MozillaWiki

Jump to: navigation, search

Contents

Results

tldr;

  • RabbitMQ is a single point of failure when the process is down on one of the nodes
  • MySql is a single point of failure when the master(s) is down.
  • The marketplace app starts to get some lock errors in the DB as soon as we put a bit of content addition (>100 RPS) - 823054 - so it does not scale unless we remove this part of our load scenarii
  • in some webheads Marketplace complains the "GeopIP server" is not installed. 823697

marketplace.png


The following table summarizes the availability of Marketplace depending on the back end states. We provide an HA Grade For each back end depending on the Marketplace behavior. When applicable, a list of follow-up bugs are linked in the table for each backend.


Backend HA Grade Notes Related Bugs
Elastic Search B Notes
Membase B Notes #819876
Redis B Notes
MySQL E Notes
RabbitMQ & Celery C Notes #823510
SMTPD B Notes
Stats & Graphite B Notes
Logstash & Metlog B Notes


HA Grades:

  • A: No interruption of service at all
  • B: Partial interruption of service when the whole cluster is taken down
  • C: Partial interruption of service when one part of the cluster is down
  • D: Full interruption of service when the whole cluster is taken down
  • E: Full interruption of service when one part of the cluster is taken down

ElasticSearch

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
Slave Down OK OK OK OK OK OK
Master Down OK OK OK OK OK OK
Everything Down KO [2] OK OK KO [1] OK OK
Everything Hanged/Slowed KO [2] OK OK KO [1] OK OK

notes

  • [1] the new apps are not indexed - the celeryd task fails
  • [2] the website hangs for 30 s.

preconisation

  • on indexation errors (cron or celeryd), we should try to keep the job somewhere to replay it if possible. see apps/addons/tasks.py:index_addons
  • shorter timeouts view and the cron/task before it fails. 5 seconds seems better for the UI and 10 seconds for the cron/task maybe?

Membase

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
Slave Down OK OK OK OK OK OK
Master Down OK OK OK OK OK OK
Everything Down OK OK KO KO OK OK
Everything Hanged/Slowed OK OK KO KO OK OK

notes

  • I have seen huge chunks of data being cached (templates) - like > 1mb irrc. We should avoid this.
  • XXX (to check) should we protect every call to memcache and make sure the app state survives it ?
  • why membase is mandatory for app submissions etc ?

preconisation

  • Is Membase the best place to cache templates ? what about disk cache ?

Redis

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
Slave Down OK OK OK OK OK OK
Master Down OK OK OK OK OK OK
Everything Down OK OK OK OK OK OK
Everything Hanged/Slowed OK OK OK OK OK OK

notes

  • Redis is going to be deprecated - less relevant anymore I guess
  • Django Cache Machine absorbs all errors in safe_redis()

MySQL

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
Master Down KO [0] KO [0] KO [0] KO [0] KO [0] OK
Slave Down OK OK OK OK OK OK
Everything Down KO [2] KO [2] KO [2] KO [2] KO [2] OK
Everything Hanged/Slowed OK OK KO [1] KO [1][3] KO [1] OK

notes

  • Single Point Of Failure when the master is down
  • [0] raw "Internal Server Error" on the web app
  • [1] no timeouts in the webapp when mysql hangs
  • [2] nginx 504 and 502 on the front page
  • [3] nginx gateway timeout on /developers/submissions

preconisation

  • Is there a way to avoid raw 504/502. A templatized screen on Zeus or Nginx?
  • we need a timeout in the marketplace app, so we can display a cleaner error, before nginx itself times out. Maybe a shorter timeout on reads.
  • can't the app work in degraded mode when the master is down ?

RabbitMQ & Celery

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
One RabbitMQ node down OK OK KO KO KO KO
One celeryd process down OK OK OK OK OK OK
Everything Down TBD TBD TBD TBD TBD TBD
Everything Hanged/Slowed TBD TBD TBD TBD TBD TBD

notes

  • Shutting down one rabbit node breaks celery - for instance IOError: Socket closed on upload_manifest the webhead is not properly doing a fallback on another node
  • kombu raises errors, the task is lost (XXX verify persistency/replay)
  • When the node gets back online we're still facing issues
  • shutting one celeryd has no impact. the other celeryd instance picks the work

preconisation

  • we should fail over another RabbitMQ node if the node associated to the webhead is down | kombu seems to be able to do this

SMTPD

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
All down OK OK OK KO OK OK
Hangs OK OK OK KO OK OK
One SMTP Down Cannot Test Cannot Test Cannot Test Cannot Test Cannot Test Cannot Test

notes

  • we could not shut down single SMTPD nodes because they are used in production. So we just did a vaurien local test
  • When a reviewer accept an application, a mail is sent out. If smtpd is down then, we get a django.db.backends.leave_transaction_managementTransactionManagementError error.
  • When SMTPD hangs we get raw nginx 504 when accepting apps
  • other emails are sent by crons

preconisation

  • while it's unlikely that both SMTPD server can be down, we could catch the error and just warns the reviewer the mail was not sent maybe?

Statds & Graphite

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
statsd down OK OK OK OK OK OK
Graphite Down TBD TBD TBD TBD TBD TBD

notes

  • If statds is down the UDP packets are just silently dropped, and we don't get stats


LogStash & Metlog

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
logstash down OK OK OK OK OK OK
Graphite Down TBD TBD TBD TBD TBD TBD

notes

  • If logstash is down the UDP packets are just silently dropped, and we don't get logs
  • If logstash is up but one of its backend server (syslog, sentry, etc) is down the UDP packets are just silently dropped, and we don't get stats
  • the messages are sent to both logstash servers

preconisations

  • we should check the impact about the duplication of messages