Socorro/Pre-PHX Smoketest Schedule: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(remove status)
(known problems)
Line 37: Line 37:
*** will require manual failover of all components
*** will require manual failover of all components


* known problems
==known problems==
** crash submitter fails to insert JSON with unicode in it
* crash submitter fails to insert JSON with unicode in it
*** "UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)"
** "UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)"
*** ~1.8% of crashes in test data, reproducable
** ~1.8% of crashes in test data, reproducable
** cannot get 100% reliable collector logs, syslog drops packets {{bug 623410}}  
* cannot get 100% reliable collector logs, syslog drops packets {{bug 623410}}  
*** need to keep this in mind when running direct-to-hbase insert mode
** need to keep this in mind when running direct-to-hbase insert mode
** rare, intermittent pycurl SSL errors
* rare, intermittent pycurl SSL errors
*** "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')"
** "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')"
*** ~0.5% of crashes, not reproducable
** ~0.5% of crashes, not reproducable

Revision as of 06:04, 18 January 2011

bug 619817

  • What we are going to test and how in terms of load
    • what:
      • at what point do collectors fall over?
        • start with 10 test nodes at 24k crashes each, versus all socorro collectors
          • ramp up to 40 test nodes
          • if it can handle full traffic, take collectors out until it fails
          • back down nodes/crashes until we find a stable place
          • check ganglia to see where our bottlenecks are
        • test both direct-to-disk and direct-to-hbase crash storage systems
      • crashes are collected without error
      • all submitted crashes are collected and processed
        • check apache logs for collector (syslog not reliable)
        • check processor and collector logs for errors
        • confirm that all crashes are stored in hbase
    • how:
      • grinder (bug 619815) + 20 VMs (bug 619814)
      • Lars added stats and iteration to submitter.py for initial smoke-test bug 622311
      • 40 seamicro nodes standing by to test, using socorro-loadtest.sh
      • pool of 240k crashes, taken over 10 days from MPT prod (Jan 1st through 10th)
    • when:
      • waiting on deps in tracking bug 619811
      • tentative start date - Wednesday Jan 12 2010
        • minimum 2-3 days testing; as much as we can get
  • what component failure tests we will run
    • disable entire components for 20min to test system recovery
      • hbase
      • postgresql
      • monitor
      • all processors
    • disable individual nodes to test the ability of the other nodes to cope and at what point they get overloaded
      • one, two, and three collectors
      • one to five processors
    • postgresql failover test
      • failover master01->master02
      • will require manual failover of all components

known problems

  • crash submitter fails to insert JSON with unicode in it
    • "UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)"
    • ~1.8% of crashes in test data, reproducable
  • cannot get 100% reliable collector logs, syslog drops packets Template:Bug 623410
    • need to keep this in mind when running direct-to-hbase insert mode
  • rare, intermittent pycurl SSL errors
    • "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')"
    • ~0.5% of crashes, not reproducable