Marketplace/HAResults: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 248: Line 248:
'''notes'''
'''notes'''


* Redis is going to be deprecated - not relevant anymore
* Redis is going to be deprecated - less relevant anymore I guess
* Django Cache Machine absorbs all errors in safe_redis()


=MySQL=
=MySQL=

Revision as of 16:21, 19 December 2012

Preamble

They are 4 major issues right now in the Stage setup when we are sending load:

Results

The following table summarizes the availability of Marketplace depending on the back end states. We provide an HA Grade For each back end depending on the Marketplace behavior. When applicable, a list of follow-up bugs are linked in the table for each backend.


Backend HA Grade Notes Related Bugs
Elastic Search B Notes
Membase B Notes #819876
Redis B Notes
MySQL E Notes
RabbitMQ C Notes
Celery TBD Notes


HA Grades:

  • A: No interruption of service at all
  • B: Partial interruption of service when the whole cluster is taken down
  • C: Partial interruption of service when one part of the cluster is down
  • D: Full interruption of service when the whole cluster is taken down
  • E: Full interruption of service when one part of the cluster is taken down

ElasticSearch

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
Slave Down OK OK OK OK OK OK
Master Down OK OK OK OK OK OK
Everything Down KO [2] OK OK KO [1] OK OK
Everything Hanged/Slowed KO [2] OK OK KO [1] OK OK

notes

  • [1] the new apps are not indexed - the celeryd task fails
  • [2] the website hangs for 30 s.

preconisation

  • on indexation errors (cron or celeryd), we should try to keep the job somewhere to replay it if possible. see apps/addons/tasks.py:index_addons
  • shorter timeouts view and the cron/task before it fails. 5 seconds seems better for the UI and 10 seconds for the cron/task maybe?

Membase

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
Slave Down OK OK OK OK OK OK
Master Down OK OK OK OK OK OK
Everything Down OK OK KO KO OK OK
Everything Hanged/Slowed OK OK KO KO OK OK

notes

  • I have seen huge chunks of data being cached (templates) - like > 1mb irrc. We should avoid this.
  • XXX (to check) should we protect every call to memcache and make sure the app state survives it ?
  • why membase is mandatory for app submissions etc ?

preconisation

  • Is Membase the best place to cache templates ? what about disk cache ?

Redis

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
Slave Down OK OK OK OK OK OK
Master Down OK OK OK OK OK OK
Everything Down OK OK OK OK OK OK
Everything Hanged/Slowed OK OK OK OK OK OK

notes

  • Redis is going to be deprecated - less relevant anymore I guess
  • Django Cache Machine absorbs all errors in safe_redis()

MySQL

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
Master Down KO [0] KO [0] KO [0] KO [0] KO [0] OK
Slave Down OK OK OK OK OK OK
Everything Down KO [2] KO [2] KO [2] KO [2] KO [2] OK
Everything Hanged/Slowed OK OK KO [1] KO [1][3] KO [1] OK

notes

  • Single Point Of Failure when the master is down
  • [0] raw "Internal Server Error" on the web app
  • [1] no timeouts in the webapp when mysql hangs
  • [2] nginx 504 and 502 on the front page
  • [3] nginx gateway timeout on /developers/submissions

preconisation

  • Is there a way to avoid raw 504/502. A templatized screen on Zeus or Nginx?
  • we need a timeout in the marketplace app, so we can display a cleaner error, before nginx itself times out. Maybe a shorter timeout on reads.
  • can't the app work in degraded mode when the master is down ?

RabbitMQ

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
One Node Down OK OK KO KO KO KO
Everything Down TBD TBD TBD TBD TBD TBD
Everything Hanged/Slowed TBD TBD TBD TBD TBD TBD

notes

  • Shutting down one node breaks celery - for instance IOError: Socket closed on upload_manifest the webhead is not properly doing a fallback on another node
  • kombu raises errors, the task is lost (XXX verify persistency/replay)
  • We managed to get a locked last inserted id on MySQL, the dabatase was in a broken state afterwards
  • When the node gets back online we're still facing issues

preconisation

  • we should fail over another RabbitMQ node if the node associated to the webhead is down
  • we need to investigate the lock

Celery

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
Slave Down TBD TBD TBD TBD TBD TBD
Master Down TBD TBD TBD TBD TBD TBD
Everything Down TBD TBD TBD TBD TBD TBD
Everything Hanged/Slowed TBD TBD TBD TBD TBD TBD

notes

XXX

results

Failure Searching Browsing Adding content Review content Indexing Self-Healing
Slave Down TBD TBD TBD TBD TBD TBD
Master Down TBD TBD TBD TBD TBD TBD
Everything Down TBD TBD TBD TBD TBD TBD
Everything Hanged/Slowed TBD TBD TBD TBD TBD TBD

notes