Marketplace/HAResults: Difference between revisions
Jump to navigation
Jump to search
Tarek.ziade (talk | contribs) (→Redis) |
Tarek.ziade (talk | contribs) (→Redis) |
||
| Line 248: | Line 248: | ||
'''notes''' | '''notes''' | ||
* Redis is going to be deprecated - | * Redis is going to be deprecated - less relevant anymore I guess | ||
* Django Cache Machine absorbs all errors in safe_redis() | |||
=MySQL= | =MySQL= | ||
Revision as of 16:21, 19 December 2012
Preamble
They are 4 major issues right now in the Stage setup when we are sending load:
- The database gets sometime lock timeouts: https://bugzilla.mozilla.org/show_bug.cgi?id=823054
- Elastic searches gets sometimes time outs on load > 500 RPS. up to 2% of the requests. That results to 504s on Nginx. See the Elastic search section of this document.
- Numerous "DoesNotExist: Addon matching query does not exist." errors - even on low load - https://bugzilla.mozilla.org/show_bug.cgi?id=821375
- in some webheads Marketplace complains the "GeopIP server" is not installed.
Results
The following table summarizes the availability of Marketplace depending on the back end states. We provide an HA Grade For each back end depending on the Marketplace behavior. When applicable, a list of follow-up bugs are linked in the table for each backend.
| Backend | HA Grade | Notes | Related Bugs |
| Elastic Search | B | Notes | |
| Membase | B | Notes | #819876 |
| Redis | B | Notes | |
| MySQL | E | Notes | |
| RabbitMQ | C | Notes | |
| Celery | TBD | Notes |
HA Grades:
- A: No interruption of service at all
- B: Partial interruption of service when the whole cluster is taken down
- C: Partial interruption of service when one part of the cluster is down
- D: Full interruption of service when the whole cluster is taken down
- E: Full interruption of service when one part of the cluster is taken down
ElasticSearch
results
| Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
| Slave Down | OK | OK | OK | OK | OK | OK |
| Master Down | OK | OK | OK | OK | OK | OK |
| Everything Down | KO [2] | OK | OK | KO [1] | OK | OK |
| Everything Hanged/Slowed | KO [2] | OK | OK | KO [1] | OK | OK |
notes
- [1] the new apps are not indexed - the celeryd task fails
- [2] the website hangs for 30 s.
preconisation
- on indexation errors (cron or celeryd), we should try to keep the job somewhere to replay it if possible. see apps/addons/tasks.py:index_addons
- shorter timeouts view and the cron/task before it fails. 5 seconds seems better for the UI and 10 seconds for the cron/task maybe?
Membase
results
| Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
| Slave Down | OK | OK | OK | OK | OK | OK |
| Master Down | OK | OK | OK | OK | OK | OK |
| Everything Down | OK | OK | KO | KO | OK | OK |
| Everything Hanged/Slowed | OK | OK | KO | KO | OK | OK |
notes
- I have seen huge chunks of data being cached (templates) - like > 1mb irrc. We should avoid this.
- XXX (to check) should we protect every call to memcache and make sure the app state survives it ?
- why membase is mandatory for app submissions etc ?
preconisation
- Is Membase the best place to cache templates ? what about disk cache ?
Redis
results
| Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
| Slave Down | OK | OK | OK | OK | OK | OK |
| Master Down | OK | OK | OK | OK | OK | OK |
| Everything Down | OK | OK | OK | OK | OK | OK |
| Everything Hanged/Slowed | OK | OK | OK | OK | OK | OK |
notes
- Redis is going to be deprecated - less relevant anymore I guess
- Django Cache Machine absorbs all errors in safe_redis()
MySQL
results
| Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
| Master Down | KO [0] | KO [0] | KO [0] | KO [0] | KO [0] | OK |
| Slave Down | OK | OK | OK | OK | OK | OK |
| Everything Down | KO [2] | KO [2] | KO [2] | KO [2] | KO [2] | OK |
| Everything Hanged/Slowed | OK | OK | KO [1] | KO [1][3] | KO [1] | OK |
notes
- Single Point Of Failure when the master is down
- [0] raw "Internal Server Error" on the web app
- [1] no timeouts in the webapp when mysql hangs
- [2] nginx 504 and 502 on the front page
- [3] nginx gateway timeout on /developers/submissions
preconisation
- Is there a way to avoid raw 504/502. A templatized screen on Zeus or Nginx?
- we need a timeout in the marketplace app, so we can display a cleaner error, before nginx itself times out. Maybe a shorter timeout on reads.
- can't the app work in degraded mode when the master is down ?
RabbitMQ
results
| Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
| One Node Down | OK | OK | KO | KO | KO | KO |
| Everything Down | TBD | TBD | TBD | TBD | TBD | TBD |
| Everything Hanged/Slowed | TBD | TBD | TBD | TBD | TBD | TBD |
notes
- Shutting down one node breaks celery - for instance IOError: Socket closed on upload_manifest the webhead is not properly doing a fallback on another node
- kombu raises errors, the task is lost (XXX verify persistency/replay)
- We managed to get a locked last inserted id on MySQL, the dabatase was in a broken state afterwards
- When the node gets back online we're still facing issues
preconisation
- we should fail over another RabbitMQ node if the node associated to the webhead is down
- we need to investigate the lock
Celery
results
| Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
| Slave Down | TBD | TBD | TBD | TBD | TBD | TBD |
| Master Down | TBD | TBD | TBD | TBD | TBD | TBD |
| Everything Down | TBD | TBD | TBD | TBD | TBD | TBD |
| Everything Hanged/Slowed | TBD | TBD | TBD | TBD | TBD | TBD |
notes
XXX
results
| Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
| Slave Down | TBD | TBD | TBD | TBD | TBD | TBD |
| Master Down | TBD | TBD | TBD | TBD | TBD | TBD |
| Everything Down | TBD | TBD | TBD | TBD | TBD | TBD |
| Everything Hanged/Slowed | TBD | TBD | TBD | TBD | TBD | TBD |
notes