Socorro/Pre-PHX Smoketest Schedule: Difference between revisions

Revision as of 07:47, 16 January 2011

bug 619817

Status
- blocked on bug 625853, "HBase insert throughput is too slow (PHX)"

What we are going to test and how in terms of load
- what:
  - at what point do collectors fall over?
    - start with 10 test nodes at 24k crashes each, versus all socorro collectors
      - ramp up to 40 test nodes
      - if it can handle full traffic, take collectors out until it fails
      - back down nodes/crashes until we find a stable place
      - check ganglia to see where our bottlenecks are
    - test both direct-to-disk and direct-to-hbase crash storage systems
  - crashes are collected without error
  - all submitted crashes are collected and processed
    - check apache logs for collector (syslog not reliable)
    - check processor and collector logs for errors
    - confirm that all crashes are stored in hbase
- how:
  - ~~grinder (bug 619815) + 20 VMs (bug 619814)~~
  - Lars added stats and iteration to submitter.py for initial smoke-test bug 622311
  - 40 seamicro nodes standing by to test, using socorro-loadtest.sh
  - pool of 240k crashes, taken over 10 days from MPT prod (Jan 1st through 10th)
- when:
  - waiting on deps in tracking bug 619811
  - tentative start date - Wednesday Jan 12 2010
    - minimum 2-3 days testing; as much as we can get
what component failure tests we will run
- disable entire components for 20min to test system recovery
  - hbase
  - postgresql
  - monitor
  - all processors
- disable individual nodes to test the ability of the other nodes to cope and at what point they get overloaded
  - one, two, and three collectors
  - one to five processors
- postgresql failover test
  - failover master01->master02
  - will require manual failover of all components

known problems
- crash submitter fails to insert JSON with unicode in it
  - "UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)"
  - ~1.8% of crashes in test data, reproducable
- cannot get 100% reliable collector logs, syslog drops packets Template:Bug 623410
  - need to keep this in mind when running direct-to-hbase insert mode
- rare, intermittent pycurl SSL errors
  - "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')"
  - ~0.5% of crashes, not reproducable

@@ Line 1: / Line 1: @@
 {{bug|619817}}
+* Status
+** blocked on {{bug|625853}}, "HBase insert throughput is too slow (PHX)"
 * What we are going to test and how in terms of load
@@ Line 36: / Line 39: @@
 *** failover master01->master02
 *** will require manual failover of all components
+* known problems
+** crash submitter fails to insert JSON with unicode in it
+*** "UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 378: ordinal not in range(128)"
+*** ~1.8% of crashes in test data, reproducable
+** cannot get 100% reliable collector logs, syslog drops packets {{bug 623410}}
+*** need to keep this in mind when running direct-to-hbase insert mode
+** rare, intermittent pycurl SSL errors
+*** "ERROR (77, 'Problem with the SSL CA cert (path? access rights?)')"
+*** ~0.5% of crashes, not reproducable

Socorro/Pre-PHX Smoketest Schedule: Difference between revisions

Revision as of 07:47, 16 January 2011

Navigation menu

Search