Latest revision as of 21:23, 21 January 2011

bug 619817

test plan

What we are going to test and how in terms of load
- what:
  - at what point do collectors fall over?
    - start with 10 test nodes at 24k crashes each, versus all socorro collectors
      - ramp up to 40 test nodes
      - if it can handle full traffic, take collectors out until it fails
      - back down nodes/crashes until we find a stable place
      - check ganglia to see where our bottlenecks are
    - test both direct-to-disk and direct-to-hbase crash storage systems
  - crashes are collected without error
  - all submitted crashes are collected and processed
    - check apache and syslog logs for collector FIXED ~~(syslog not reliable)~~
    - check processor and collector logs for errors
    - confirm that all crashes are stored in hbase
- how:
  - ~~grinder (bug 619815) + 20 VMs (bug 619814)~~
  - Lars added stats and iteration to submitter.py for initial smoke-test bug 622311
  - 40 seamicro nodes standing by to test, using socorro-loadtest.sh
  - pool of 240k crashes, taken over 10 days from MPT prod (Jan 1st through 10th)
- when:
  - waiting on deps in tracking bug 619811
  - tentative start date - Wednesday Jan 12 2010
    - minimum 2-3 days testing; as much as we can get
what component failure tests we will run
- disable entire components for 20min to test system recovery
  - hbase
  - postgresql
  - monitor
  - all processors
- disable individual nodes to test the ability of the other nodes to cope and at what point they get overloaded
  - one, two, and three collectors
  - one to five processors
- postgresql failover test
  - failover master01->master02
  - will require manual failover of all components
- hbase specific failure tests
  - kill:
    - random region server
    - important region server
    - random thrift server
      - whoever is connceted should get a blip
    - hbase master
    - name node
    - force drbd failover (coordinate with IT)
      - specifically this means: killing the hardware for primary set of admin notes

known problems

unicode-in-metadata problem

crash submitter fails to insert JSON with unicode in it

"UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)"

- ~1.8% of crashes in test data, reproducable

unreliable syslog

FIXED ~~cannot get 100% reliable collector logs, syslog drops packets bug 623410~~
- ~~need to keep this in mind when running direct-to-hbase insert mode~~
rare, intermittent pycurl SSL errors

intermittent pycurl SSL errors

"ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')"
- prevents crash submission
- ~0.5% of crash crash submissions, not reproducable

pycurl random crashes

*** longjmp causes uninitialized stack frame ***: python terminated
======= Backtrace: =========
/lib/libc.so.6(__fortify_fail+0x4d)[0x274fed]
/lib/libc.so.6[0x274f5a]
/lib/libc.so.6(__longjmp_chk+0x49)[0x274ec9]
/usr/lib/libcurl.so.4[0x5874b99]
[0x641400]
[0x641424]

this stops the run on one node
hacked around in the socorro-loadtest.sh for the "forever" infinite-loop case

notes

had to restart processors and then monitor to get them all connected
- make sure to check http://crash-stats.mozilla.com/status

@@ Line 1: / Line 1: @@
 {{bug|619817}}
+= test plan =
 * What we are going to test and how in terms of load
 ** what:
+*** at what point do collectors fall over?
+**** start with 10 test nodes at 24k crashes each, versus all socorro collectors
+***** ramp up to 40 test nodes
+***** if it can handle full traffic, take collectors out until it fails
+***** back down nodes/crashes until we find a stable place
+***** check ganglia to see where our bottlenecks are
+**** test both direct-to-disk and direct-to-hbase crash storage systems
 *** crashes are collected without error
-*** all submitted crashes are collected
+*** all submitted crashes are collected and processed
+**** check apache and syslog logs for collector FIXED <strike>(syslog not reliable)</strike>
+**** check processor and collector logs for errors
+**** confirm that all crashes are stored in hbase
 ** how:
 *** <strike>grinder ({{bug|619815}}) + 20 VMs ({{bug|619814}})</strike>
-*** Lars to add stats and iteration to submitter.py for initial smoke-test (bug needed)
+*** Lars added stats and iteration to submitter.py for initial smoke-test {{bug|622311}}
+*** 40 seamicro nodes standing by to test, using [https://bug619814.bugzilla.mozilla.org/attachment.cgi?id=503222 socorro-loadtest.sh]
+*** pool of 240k crashes, taken over 10 days from MPT prod (Jan 1st through 10th)
 ** when:
 *** waiting on deps in tracking {{bug|619811}}
+*** tentative start date - Wednesday Jan 12 2010
+**** minimum 2-3 days testing; as much as we can get
 * what component failure tests we will run
-** disable individual components to see test failure/recovery
+** disable entire components for 20min to test system recovery
 *** hbase
 *** postgresql
 *** monitor
-*** processor
+*** all processors
-*** others?
+** disable individual nodes to test the ability of the other nodes to cope and at what point they get overloaded
+*** one, two, and three collectors
+*** one to five processors
+** postgresql failover test
+*** failover master01->master02
+*** will require manual failover of all components
+** hbase specific failure tests
+*** kill:
+**** random region server
+**** important region server
+**** random thrift server
+***** whoever is connceted should get a blip
+**** hbase master
+**** name node
+**** force drbd failover (coordinate with IT)
+***** specifically this means: killing the hardware for primary set of admin notes
+=known problems=
+== unicode-in-metadata problem==
+* crash submitter fails to insert JSON with unicode in it
+ "UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)"
+** ~1.8% of crashes in test data, reproducable
+== unreliable syslog ==
+* FIXED <strike>cannot get 100% reliable collector logs, syslog drops packets {{bug|623410}}
+** need to keep this in mind when running direct-to-hbase insert mode</strike>
+* rare, intermittent pycurl SSL errors
+== intermittent pycurl SSL errors ==
+* "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')"
+** prevents crash submission
+** ~0.5% of crash crash submissions, not reproducable
+== pycurl random crashes ==
+ *** longjmp causes uninitialized stack frame ***: python terminated
+ ======= Backtrace: =========
+ /lib/libc.so.6(__fortify_fail+0x4d)[0x274fed]
+ /lib/libc.so.6[0x274f5a]
+ /lib/libc.so.6(__longjmp_chk+0x49)[0x274ec9]
+ /usr/lib/libcurl.so.4[0x5874b99]
+ [0x641400]
+ [0x641424]
+* this stops the run on one node
+* hacked around in the socorro-loadtest.sh for the "forever" infinite-loop case
+= notes =
+* had to restart processors and then monitor to get them all connected
+** make sure to check http://crash-stats.mozilla.com/status

Socorro/Pre-PHX Smoketest Schedule: Difference between revisions

Latest revision as of 21:23, 21 January 2011

Contents

test plan

known problems

unicode-in-metadata problem

unreliable syslog

intermittent pycurl SSL errors

pycurl random crashes

notes

Navigation menu

Socorro/Pre-PHX Smoketest Schedule: Difference between revisions

Latest revision as of 21:23, 21 January 2011

test plan

known problems

unicode-in-metadata problem

unreliable syslog

intermittent pycurl SSL errors

pycurl random crashes

notes

Navigation menu

Search