Socorro/Pre-PHX Smoketest Schedule: Difference between revisions
< Socorro
Jump to navigation
Jump to search
(updates) |
(notes) |
||
| (29 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
{{bug|619817}} | {{bug|619817}} | ||
= test plan = | |||
* What we are going to test and how in terms of load | * What we are going to test and how in terms of load | ||
** what: | ** what: | ||
*** at what point do collectors fall over? | |||
**** start with 10 test nodes at 24k crashes each, versus all socorro collectors | |||
***** ramp up to 40 test nodes | |||
***** if it can handle full traffic, take collectors out until it fails | |||
***** back down nodes/crashes until we find a stable place | |||
***** check ganglia to see where our bottlenecks are | |||
**** test both direct-to-disk and direct-to-hbase crash storage systems | |||
*** crashes are collected without error | *** crashes are collected without error | ||
*** all submitted crashes are collected | *** all submitted crashes are collected and processed | ||
**** check apache and syslog logs for collector FIXED <strike>(syslog not reliable)</strike> | |||
**** check processor and collector logs for errors | |||
**** confirm that all crashes are stored in hbase | |||
** how: | ** how: | ||
*** <strike>grinder ({{bug|619815}}) + 20 VMs ({{bug|619814}})</strike> | *** <strike>grinder ({{bug|619815}}) + 20 VMs ({{bug|619814}})</strike> | ||
*** Lars | *** Lars added stats and iteration to submitter.py for initial smoke-test {{bug|622311}} | ||
*** 40 seamicro nodes standing by to test, using [https://bug619814.bugzilla.mozilla.org/attachment.cgi?id=503222 socorro-loadtest.sh] | |||
*** pool of 240k crashes, taken over 10 days from MPT prod (Jan 1st through 10th) | |||
** when: | ** when: | ||
*** waiting on deps in tracking {{bug|619811}} | *** waiting on deps in tracking {{bug|619811}} | ||
*** tentative start date - Wednesday Jan 12 2010 | |||
**** minimum 2-3 days testing; as much as we can get | |||
* what component failure tests we will run | * what component failure tests we will run | ||
** disable | ** disable entire components for 20min to test system recovery | ||
*** hbase | *** hbase | ||
*** postgresql | *** postgresql | ||
*** monitor | *** monitor | ||
*** | *** all processors | ||
*** | ** disable individual nodes to test the ability of the other nodes to cope and at what point they get overloaded | ||
*** one, two, and three collectors | |||
*** one to five processors | |||
** postgresql failover test | |||
*** failover master01->master02 | |||
*** will require manual failover of all components | |||
** hbase specific failure tests | |||
*** kill: | |||
**** random region server | |||
**** important region server | |||
**** random thrift server | |||
***** whoever is connceted should get a blip | |||
**** hbase master | |||
**** name node | |||
**** force drbd failover (coordinate with IT) | |||
***** specifically this means: killing the hardware for primary set of admin notes | |||
=known problems= | |||
== unicode-in-metadata problem== | |||
* crash submitter fails to insert JSON with unicode in it | |||
"UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)" | |||
** ~1.8% of crashes in test data, reproducable | |||
== unreliable syslog == | |||
* FIXED <strike>cannot get 100% reliable collector logs, syslog drops packets {{bug|623410}} | |||
** need to keep this in mind when running direct-to-hbase insert mode</strike> | |||
* rare, intermittent pycurl SSL errors | |||
== intermittent pycurl SSL errors == | |||
* "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')" | |||
** prevents crash submission | |||
** ~0.5% of crash crash submissions, not reproducable | |||
== pycurl random crashes == | |||
*** longjmp causes uninitialized stack frame ***: python terminated | |||
======= Backtrace: ========= | |||
/lib/libc.so.6(__fortify_fail+0x4d)[0x274fed] | |||
/lib/libc.so.6[0x274f5a] | |||
/lib/libc.so.6(__longjmp_chk+0x49)[0x274ec9] | |||
/usr/lib/libcurl.so.4[0x5874b99] | |||
[0x641400] | |||
[0x641424] | |||
* this stops the run on one node | |||
* hacked around in the socorro-loadtest.sh for the "forever" infinite-loop case | |||
= notes = | |||
* had to restart processors and then monitor to get them all connected | |||
** make sure to check http://crash-stats.mozilla.com/status | |||
Latest revision as of 21:23, 21 January 2011
test plan
- What we are going to test and how in terms of load
- what:
- at what point do collectors fall over?
- start with 10 test nodes at 24k crashes each, versus all socorro collectors
- ramp up to 40 test nodes
- if it can handle full traffic, take collectors out until it fails
- back down nodes/crashes until we find a stable place
- check ganglia to see where our bottlenecks are
- test both direct-to-disk and direct-to-hbase crash storage systems
- start with 10 test nodes at 24k crashes each, versus all socorro collectors
- crashes are collected without error
- all submitted crashes are collected and processed
- check apache and syslog logs for collector FIXED
(syslog not reliable) - check processor and collector logs for errors
- confirm that all crashes are stored in hbase
- check apache and syslog logs for collector FIXED
- at what point do collectors fall over?
- how:
grinder (bug 619815) + 20 VMs (bug 619814)- Lars added stats and iteration to submitter.py for initial smoke-test bug 622311
- 40 seamicro nodes standing by to test, using socorro-loadtest.sh
- pool of 240k crashes, taken over 10 days from MPT prod (Jan 1st through 10th)
- when:
- waiting on deps in tracking bug 619811
- tentative start date - Wednesday Jan 12 2010
- minimum 2-3 days testing; as much as we can get
- what:
- what component failure tests we will run
- disable entire components for 20min to test system recovery
- hbase
- postgresql
- monitor
- all processors
- disable individual nodes to test the ability of the other nodes to cope and at what point they get overloaded
- one, two, and three collectors
- one to five processors
- postgresql failover test
- failover master01->master02
- will require manual failover of all components
- hbase specific failure tests
- kill:
- random region server
- important region server
- random thrift server
- whoever is connceted should get a blip
- hbase master
- name node
- force drbd failover (coordinate with IT)
- specifically this means: killing the hardware for primary set of admin notes
- kill:
- disable entire components for 20min to test system recovery
known problems
unicode-in-metadata problem
- crash submitter fails to insert JSON with unicode in it
"UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)"
- ~1.8% of crashes in test data, reproducable
unreliable syslog
- FIXED
cannot get 100% reliable collector logs, syslog drops packets bug 623410need to keep this in mind when running direct-to-hbase insert mode
- rare, intermittent pycurl SSL errors
intermittent pycurl SSL errors
- "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')"
- prevents crash submission
- ~0.5% of crash crash submissions, not reproducable
pycurl random crashes
*** longjmp causes uninitialized stack frame ***: python terminated ======= Backtrace: ========= /lib/libc.so.6(__fortify_fail+0x4d)[0x274fed] /lib/libc.so.6[0x274f5a] /lib/libc.so.6(__longjmp_chk+0x49)[0x274ec9] /usr/lib/libcurl.so.4[0x5874b99] [0x641400] [0x641424]
- this stops the run on one node
- hacked around in the socorro-loadtest.sh for the "forever" infinite-loop case
notes
- had to restart processors and then monitor to get them all connected
- make sure to check http://crash-stats.mozilla.com/status