Revision as of 16:21, 19 December 2012

Preamble

They are 4 major issues right now in the Stage setup when we are sending load:

The database gets sometime lock timeouts: https://bugzilla.mozilla.org/show_bug.cgi?id=823054
Elastic searches gets sometimes time outs on load > 500 RPS. up to 2% of the requests. That results to 504s on Nginx. See the Elastic search section of this document.
Numerous "DoesNotExist: Addon matching query does not exist." errors - even on low load - https://bugzilla.mozilla.org/show_bug.cgi?id=821375
in some webheads Marketplace complains the "GeopIP server" is not installed.

Results

The following table summarizes the availability of Marketplace depending on the back end states. We provide an HA Grade For each back end depending on the Marketplace behavior. When applicable, a list of follow-up bugs are linked in the table for each backend.

Backend	HA Grade	Notes	Related Bugs
Elastic Search	B	Notes
Membase	B	Notes	#819876
Redis	B	Notes
MySQL	E	Notes
RabbitMQ	C	Notes
Celery	TBD	Notes

HA Grades:

A: No interruption of service at all
B: Partial interruption of service when the whole cluster is taken down
C: Partial interruption of service when one part of the cluster is down
D: Full interruption of service when the whole cluster is taken down
E: Full interruption of service when one part of the cluster is taken down

ElasticSearch

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
Slave Down	OK	OK	OK	OK	OK	OK
Master Down	OK	OK	OK	OK	OK	OK
Everything Down	KO [2]	OK	OK	KO [1]	OK	OK
Everything Hanged/Slowed	KO [2]	OK	OK	KO [1]	OK	OK

notes

[1] the new apps are not indexed - the celeryd task fails
[2] the website hangs for 30 s.

preconisation

on indexation errors (cron or celeryd), we should try to keep the job somewhere to replay it if possible. see apps/addons/tasks.py:index_addons
shorter timeouts view and the cron/task before it fails. 5 seconds seems better for the UI and 10 seconds for the cron/task maybe?

Membase

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
Slave Down	OK	OK	OK	OK	OK	OK
Master Down	OK	OK	OK	OK	OK	OK
Everything Down	OK	OK	KO	KO	OK	OK
Everything Hanged/Slowed	OK	OK	KO	KO	OK	OK

notes

I have seen huge chunks of data being cached (templates) - like > 1mb irrc. We should avoid this.
XXX (to check) should we protect every call to memcache and make sure the app state survives it ?
why membase is mandatory for app submissions etc ?

preconisation

Is Membase the best place to cache templates ? what about disk cache ?

Redis

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
Slave Down	OK	OK	OK	OK	OK	OK
Master Down	OK	OK	OK	OK	OK	OK
Everything Down	OK	OK	OK	OK	OK	OK
Everything Hanged/Slowed	OK	OK	OK	OK	OK	OK

notes

Redis is going to be deprecated - less relevant anymore I guess
Django Cache Machine absorbs all errors in safe_redis()

MySQL

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
Master Down	KO [0]	KO [0]	KO [0]	KO [0]	KO [0]	OK
Slave Down	OK	OK	OK	OK	OK	OK
Everything Down	KO [2]	KO [2]	KO [2]	KO [2]	KO [2]	OK
Everything Hanged/Slowed	OK	OK	KO [1]	KO [1][3]	KO [1]	OK

notes

Single Point Of Failure when the master is down
[0] raw "Internal Server Error" on the web app
[1] no timeouts in the webapp when mysql hangs
[2] nginx 504 and 502 on the front page
[3] nginx gateway timeout on /developers/submissions

preconisation

Is there a way to avoid raw 504/502. A templatized screen on Zeus or Nginx?
we need a timeout in the marketplace app, so we can display a cleaner error, before nginx itself times out. Maybe a shorter timeout on reads.
can't the app work in degraded mode when the master is down ?

RabbitMQ

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
One Node Down	OK	OK	KO	KO	KO	KO
Everything Down	TBD	TBD	TBD	TBD	TBD	TBD
Everything Hanged/Slowed	TBD	TBD	TBD	TBD	TBD	TBD

notes

Shutting down one node breaks celery - for instance IOError: Socket closed on upload_manifest the webhead is not properly doing a fallback on another node
kombu raises errors, the task is lost (XXX verify persistency/replay)
We managed to get a locked last inserted id on MySQL, the dabatase was in a broken state afterwards
When the node gets back online we're still facing issues

preconisation

we should fail over another RabbitMQ node if the node associated to the webhead is down
we need to investigate the lock

Celery

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
Slave Down	TBD	TBD	TBD	TBD	TBD	TBD
Master Down	TBD	TBD	TBD	TBD	TBD	TBD
Everything Down	TBD	TBD	TBD	TBD	TBD	TBD
Everything Hanged/Slowed	TBD	TBD	TBD	TBD	TBD	TBD

notes

XXX

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
Slave Down	TBD	TBD	TBD	TBD	TBD	TBD
Master Down	TBD	TBD	TBD	TBD	TBD	TBD
Everything Down	TBD	TBD	TBD	TBD	TBD	TBD
Everything Hanged/Slowed	TBD	TBD	TBD	TBD	TBD	TBD

notes

@@ Line 248: / Line 248: @@
 '''notes'''
-* Redis is going to be deprecated - not relevant anymore
+* Redis is going to be deprecated - less relevant anymore I guess
+* Django Cache Machine absorbs all errors in safe_redis()
 =MySQL=

Marketplace/HAResults: Difference between revisions

Revision as of 16:21, 19 December 2012

Contents

Preamble

Results

ElasticSearch

Membase

Redis

MySQL

RabbitMQ

Celery

XXX

Navigation menu