Socorro/Pre-PHX Smoketest Schedule: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(updates)
(notes)
 
(29 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{bug|619817}}
{{bug|619817}}


= test plan =
* What we are going to test and how in terms of load
* What we are going to test and how in terms of load
** what:
** what:
*** at what point do collectors fall over?
**** start with 10 test nodes at 24k crashes each, versus all socorro collectors
***** ramp up to 40 test nodes
***** if it can handle full traffic, take collectors out until it fails
***** back down nodes/crashes until we find a stable place
***** check ganglia to see where our bottlenecks are
**** test both direct-to-disk and direct-to-hbase crash storage systems
*** crashes are collected without error
*** crashes are collected without error
*** all submitted crashes are collected
*** all submitted crashes are collected and processed
**** check apache and syslog logs for collector FIXED <strike>(syslog not reliable)</strike>
**** check processor and collector logs for errors
**** confirm that all crashes are stored in hbase
** how:  
** how:  
*** <strike>grinder ({{bug|619815}}) + 20 VMs ({{bug|619814}})</strike>
*** <strike>grinder ({{bug|619815}}) + 20 VMs ({{bug|619814}})</strike>
*** Lars to add stats and iteration to submitter.py for initial smoke-test (bug needed)
*** Lars added stats and iteration to submitter.py for initial smoke-test {{bug|622311}}
*** 40 seamicro nodes standing by to test, using [https://bug619814.bugzilla.mozilla.org/attachment.cgi?id=503222 socorro-loadtest.sh]
*** pool of 240k crashes, taken over 10 days from MPT prod (Jan 1st through 10th)
** when:
** when:
*** waiting on deps in tracking {{bug|619811}}
*** waiting on deps in tracking {{bug|619811}}
*** tentative start date - Wednesday Jan 12 2010
**** minimum 2-3 days testing; as much as we can get
* what component failure tests we will run
* what component failure tests we will run
** disable individual components to see test failure/recovery
** disable entire components for 20min to test system recovery
*** hbase
*** hbase
*** postgresql
*** postgresql
*** monitor
*** monitor
*** processor
*** all processors
*** others?
** disable individual nodes to test the ability of the other nodes to cope and at what point they get overloaded
*** one, two, and three collectors
*** one to five processors
** postgresql failover test
*** failover master01->master02
*** will require manual failover of all components
** hbase specific failure tests
*** kill:
**** random region server
**** important region server
**** random thrift server
***** whoever is connceted should get a blip
**** hbase master
**** name node
**** force drbd failover (coordinate with IT)
***** specifically this means: killing the hardware for primary set of admin notes
 
=known problems=
== unicode-in-metadata problem==
* crash submitter fails to insert JSON with unicode in it
"UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)"
** ~1.8% of crashes in test data, reproducable
== unreliable syslog ==
* FIXED <strike>cannot get 100% reliable collector logs, syslog drops packets {{bug|623410}}
** need to keep this in mind when running direct-to-hbase insert mode</strike>
* rare, intermittent pycurl SSL errors
 
== intermittent pycurl SSL errors ==
* "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')"
** prevents crash submission
** ~0.5% of crash crash submissions, not reproducable
 
== pycurl random crashes ==
*** longjmp causes uninitialized stack frame ***: python terminated
======= Backtrace: =========
/lib/libc.so.6(__fortify_fail+0x4d)[0x274fed]
/lib/libc.so.6[0x274f5a]
/lib/libc.so.6(__longjmp_chk+0x49)[0x274ec9]
/usr/lib/libcurl.so.4[0x5874b99]
[0x641400]
[0x641424]
* this stops the run on one node
* hacked around in the socorro-loadtest.sh for the "forever" infinite-loop case
= notes =
* had to restart processors and then monitor to get them all connected
** make sure to check http://crash-stats.mozilla.com/status

Latest revision as of 21:23, 21 January 2011

bug 619817

test plan

  • What we are going to test and how in terms of load
    • what:
      • at what point do collectors fall over?
        • start with 10 test nodes at 24k crashes each, versus all socorro collectors
          • ramp up to 40 test nodes
          • if it can handle full traffic, take collectors out until it fails
          • back down nodes/crashes until we find a stable place
          • check ganglia to see where our bottlenecks are
        • test both direct-to-disk and direct-to-hbase crash storage systems
      • crashes are collected without error
      • all submitted crashes are collected and processed
        • check apache and syslog logs for collector FIXED (syslog not reliable)
        • check processor and collector logs for errors
        • confirm that all crashes are stored in hbase
    • how:
      • grinder (bug 619815) + 20 VMs (bug 619814)
      • Lars added stats and iteration to submitter.py for initial smoke-test bug 622311
      • 40 seamicro nodes standing by to test, using socorro-loadtest.sh
      • pool of 240k crashes, taken over 10 days from MPT prod (Jan 1st through 10th)
    • when:
      • waiting on deps in tracking bug 619811
      • tentative start date - Wednesday Jan 12 2010
        • minimum 2-3 days testing; as much as we can get
  • what component failure tests we will run
    • disable entire components for 20min to test system recovery
      • hbase
      • postgresql
      • monitor
      • all processors
    • disable individual nodes to test the ability of the other nodes to cope and at what point they get overloaded
      • one, two, and three collectors
      • one to five processors
    • postgresql failover test
      • failover master01->master02
      • will require manual failover of all components
    • hbase specific failure tests
      • kill:
        • random region server
        • important region server
        • random thrift server
          • whoever is connceted should get a blip
        • hbase master
        • name node
        • force drbd failover (coordinate with IT)
          • specifically this means: killing the hardware for primary set of admin notes

known problems

unicode-in-metadata problem

  • crash submitter fails to insert JSON with unicode in it
"UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)"
    • ~1.8% of crashes in test data, reproducable

unreliable syslog

  • FIXED cannot get 100% reliable collector logs, syslog drops packets bug 623410
    • need to keep this in mind when running direct-to-hbase insert mode
  • rare, intermittent pycurl SSL errors

intermittent pycurl SSL errors

  • "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')"
    • prevents crash submission
    • ~0.5% of crash crash submissions, not reproducable

pycurl random crashes

*** longjmp causes uninitialized stack frame ***: python terminated
======= Backtrace: =========
/lib/libc.so.6(__fortify_fail+0x4d)[0x274fed]
/lib/libc.so.6[0x274f5a]
/lib/libc.so.6(__longjmp_chk+0x49)[0x274ec9]
/usr/lib/libcurl.so.4[0x5874b99]
[0x641400]
[0x641424]
  • this stops the run on one node
  • hacked around in the socorro-loadtest.sh for the "forever" infinite-loop case

notes