Socorro/Pre-PHX Smoketest Schedule: Difference between revisions
< Socorro
Jump to navigation
Jump to search
(note about crash system) |
(problem?) |
||
| Line 1: | Line 1: | ||
{{bug|619817}} | {{bug|619817}} | ||
* Status | |||
** blocked on {{bug|625853}}, "HBase insert throughput is too slow (PHX)" | |||
* What we are going to test and how in terms of load | * What we are going to test and how in terms of load | ||
| Line 36: | Line 39: | ||
*** failover master01->master02 | *** failover master01->master02 | ||
*** will require manual failover of all components | *** will require manual failover of all components | ||
* known problems | |||
** crash submitter fails to insert JSON with unicode in it | |||
*** "UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)" | |||
*** ~1.8% of crashes in test data, reproducable | |||
** cannot get 100% reliable collector logs, syslog drops packets {{bug 623410}} | |||
*** need to keep this in mind when running direct-to-hbase insert mode | |||
** rare, intermittent pycurl SSL errors | |||
*** "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')" | |||
*** ~0.5% of crashes, not reproducable | |||
Revision as of 07:47, 16 January 2011
- Status
- blocked on bug 625853, "HBase insert throughput is too slow (PHX)"
- What we are going to test and how in terms of load
- what:
- at what point do collectors fall over?
- start with 10 test nodes at 24k crashes each, versus all socorro collectors
- ramp up to 40 test nodes
- if it can handle full traffic, take collectors out until it fails
- back down nodes/crashes until we find a stable place
- check ganglia to see where our bottlenecks are
- test both direct-to-disk and direct-to-hbase crash storage systems
- start with 10 test nodes at 24k crashes each, versus all socorro collectors
- crashes are collected without error
- all submitted crashes are collected and processed
- check apache logs for collector (syslog not reliable)
- check processor and collector logs for errors
- confirm that all crashes are stored in hbase
- at what point do collectors fall over?
- how:
grinder (bug 619815) + 20 VMs (bug 619814)- Lars added stats and iteration to submitter.py for initial smoke-test bug 622311
- 40 seamicro nodes standing by to test, using socorro-loadtest.sh
- pool of 240k crashes, taken over 10 days from MPT prod (Jan 1st through 10th)
- when:
- waiting on deps in tracking bug 619811
- tentative start date - Wednesday Jan 12 2010
- minimum 2-3 days testing; as much as we can get
- what:
- what component failure tests we will run
- disable entire components for 20min to test system recovery
- hbase
- postgresql
- monitor
- all processors
- disable individual nodes to test the ability of the other nodes to cope and at what point they get overloaded
- one, two, and three collectors
- one to five processors
- postgresql failover test
- failover master01->master02
- will require manual failover of all components
- disable entire components for 20min to test system recovery
- known problems
- crash submitter fails to insert JSON with unicode in it
- "UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)"
- ~1.8% of crashes in test data, reproducable
- cannot get 100% reliable collector logs, syslog drops packets Template:Bug 623410
- need to keep this in mind when running direct-to-hbase insert mode
- rare, intermittent pycurl SSL errors
- "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')"
- ~0.5% of crashes, not reproducable
- crash submitter fails to insert JSON with unicode in it