TestEngineering/Services/LoadsToolsAndTesting1: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
 
(11 intermediate revisions by one other user not shown)
Line 1: Line 1:
NOTE: Source: https://etherpad.mozilla.org/Loads-Current-Status-Aug2014
* NOTE 1: Source: https://etherpad.mozilla.org/Loads-Current-Status-Aug2014
* NOTE 2: This is specifically for Loads V1
* NOTE 3: For Loads V2 information, please see https://wiki.mozilla.org/QA/Services/LoadsToolsAndTesting2


= Loads V1 and Vaurien =
= Loads V1 and Vaurien =
Line 7: Line 9:
** One active(?) POC for Vaurien is with GeoLocation (ichnaea).
** One active(?) POC for Vaurien is with GeoLocation (ichnaea).


== Usage Rules ==
== Loads Cluster Usage Rules ==
* Note: There are a number of open bugs and issues (see below) that require Loads use to be focused and specific per project:
* Note: There are a number of open bugs and issues (see below) that require Loads use to be focused and specific per project:
** Do not over do the loads test - start with the default values in the config files.
** Do not over do the loads test - start with the default values in the config files.
Line 14: Line 16:
** Do not run a load test of more than 8 - 10 hours
** Do not run a load test of more than 8 - 10 hours
** There are more limitations/rules...
** There are more limitations/rules...
== Loads V1 Cluster Environment/Stack ==
* URLs
** http://loads.services.mozilla.com/
** or http://ec2-54-212-44-143.us-west-2.compute.amazonaws.com/
* Versions
Loads Cluster/Broker/Agents:
$ cd /home/ubuntu/loads/bin
$ ./loads-runner --version
* AWS in US West
** loads-master (broker and agent processes)
** loads-slave-1 (agent processes)
** loads-slave-2 (agent processes)
** NOTE: there is no stack or ELB for this cluster
* Files
** /home/ubuntu
*** loads
*** loads-aws
*** loads-web
* Processes
** Search for processes owned by ubuntu, loads, nginx, circus
* Logs
** /var/log/redis
** /var/log/nginx
* QA access
** You need special access to be able to SSH into these devices
** You need to make some changes to your .ssh/config file
* Links
** http://loads.readthedocs.org/en/latest/
** https://github.com/mozilla-services/loads
** https://github.com/mozilla-services/loads-aws
== Loads V1 Cluster Monitoring ==
* Loads Dashboard
** http://loads.services.mozilla.com
* Stackdriver
** https://app.stackdriver.com/groups/6664/stage-loads-cluster
* Cluster status
** Check directly from the Loads Cluster dashboard: http://loads.services.mozilla.com
Agents statuses
Launch a health check on all agents
== Loads V1 Cluster Maintenance ==
* If things should go wrong...
* Checking the cluster dashboard
* TBD
* Checking the stack
* TBD
* Restarting the Master/Broker
* TBD
* Restarting the Slaves/Agents
* TBD


== Repos ==
== Repos ==
Line 50: Line 117:
* StackDriver: https://app.stackdriver.com/groups/6664/stage-loads-cluster  
* StackDriver: https://app.stackdriver.com/groups/6664/stage-loads-cluster  


== Monitoring the Loads Cluster ==
== Monitoring the Loads Cluster via the Dashboard ==
* Via the dashboard: http://loads.services.mozilla.com/
* Dashboard: http://loads.services.mozilla.com/
* Check the loads cluster state/health directly from the dashboard:
* Check the loads cluster state/health directly from the dashboard:
** Agents statuses
** Agents statuses
Line 109: Line 176:
** Verifier: https://github.com/mozilla/browserid-verifier/issues/50
** Verifier: https://github.com/mozilla/browserid-verifier/issues/50
** Sync: https://github.com/mozilla-services/server-syncstorage/issues/19
** Sync: https://github.com/mozilla-services/server-syncstorage/issues/19
= Loads V2 =
* What is it?
* Changes for V2
* Overview/Slides: http://blog.ziade.org/slides/loadsv2/#/
* Initial Diagram: http://blog.ziade.org/loads.jpg
* Initial Look: https://etherpad.mozilla.org/Loadsv2
* Ben's design work: https://etherpad.mozilla.org/loadsv2-design
== Comparison of Load Test Tools ==
* Siege: http://www.joedog.org/siege-home/
* And some others in comparison: http://www.appdynamics.com/blog/devops/load-testing-tools-explained-the-server-side/
* https://github.com/newsapps/beeswithmachineguns
* Some of these require large sums of $ in order to run adequate load tests (size/time)
* Straight HTTP vs. smart tests (that we are sending)
* Dumb testing vs. smart testing (what we are doing)
* Some of the off-the-shelf are quite limited - we need to be able to use a programming language to define very specific tests/requirements
* The Grinder, for example,  is not really designed to be deployed on AWS, for example.
* Tsunami is good at sending a lot of load on web service, but it requires writing XML
== Tasks ==
* Tarek says:
* So what I was thinking: I can lead Loads v2 development with the help of QA and Ops and Benjamin for SimplePush, and then slowly transition ownership to the QA and Ops team - because at the end of the day that's the two teams that should benefit the most from this tool.
== New Repos ==
* https://github.com/loads
* https://github.com/loads/docs
* https://github.com/loads/loads-broker
* https://github.com/loads/loads-tester
* https://github.com/loads/old-loads-agent
* https://github.com/loads/old-loads-broker
* https://github.com/loads/old-loads-base
* https://github.com/loads/old-loads-web
* Note: naming is a bit strange right now because the architecture is in transition
== New Documentation ==
* TBD: for now see https://github.com/loads/docs
== Brown Bag and Info Session ==
* https://etherpad.mozilla.org/loads-brownbag
* Note: This will not take place in September, but could take place in December (2014)
= Brainstorming Loads and V2 =
* What we need going forward
* What we want going forward
* Some issues (generalized - see the github issues for details):
** 1- very long runs (>10hours) are not really working. This is a design problem.
** 2- spinning new slaves to make big tests has not yet been automated. We have 2 slaves boxes that run 10 agents each. This was enough for most of our needs though.
** 3- The dashboard is scarce. It'll tell you what;s going on, but we don't have any real reporting features yet.
** 4- running a test using another language than Python is a bit of a pain (you need to do some zmq messaging)
* Stealing from Tarek's slide deck:
** Day-long runs don't really work
** Crappy dashboard
** No direct link to Logs/CPU/Mem usage of stressed servers
** No automatic slaves deployment yet
** Python client only really supported
** High bar to implement clients in Haskell/Go
* Also, we have a lot of open bugs that need to get fixed. Some prevent better use of the tool for newer projects/services.
** Get Loads "fixed" for Mac 10.9 and XCode 5.1.1: https://bugzilla.mozilla.org/show_bug.cgi?id=1010567
* Figure out how to run loads from personal AWS instances
* Monitoring
** What we currently have for Stage
** What do we want/need?
* Reporting
* Loads dashboard
** What about CPU/memory information (like from atop, tops)
** Links to some snapshoted graphs
** code version
** red/yellow/green states
** Deployment bug
** Bugs opened
** Bugs closed
* Scaling the cluster dynamically (see V2)
* Quarterly results/trending
* Targets
** PM targets
** expected targets
** actual targets
* Wiki design
** One per service?
** One per service per deployment?
* Weekly reporting
** What does the QE team want to see
* Getting the right data/metrics requirements from PMs then extracting that information and displaying on the Loads dashboard and/or in the OPs-built dashboards

Latest revision as of 20:03, 26 August 2016

Loads V1 and Vaurien

Two tools Loads (V1) and Vaurien

  • Most Stage deployment verification is partially handled through the use of the Loads tool for stress/load (and someday performance) testing.
  • Vaurien is a TCP proxy which will let you simulate chaos between your application and a backend server.
    • One active(?) POC for Vaurien is with GeoLocation (ichnaea).

Loads Cluster Usage Rules

  • Note: There are a number of open bugs and issues (see below) that require Loads use to be focused and specific per project:
    • Do not over do the loads test - start with the default values in the config files.
    • Do not run more than two tests in parallel.
    • Do not use more than 5 agents per load test unless you need to use more.
    • Do not run a load test of more than 8 - 10 hours
    • There are more limitations/rules...

Loads V1 Cluster Environment/Stack

  • Versions
Loads Cluster/Broker/Agents:
$ cd /home/ubuntu/loads/bin
$ ./loads-runner --version
  • AWS in US West
    • loads-master (broker and agent processes)
    • loads-slave-1 (agent processes)
    • loads-slave-2 (agent processes)
    • NOTE: there is no stack or ELB for this cluster
  • Files
    • /home/ubuntu
      • loads
      • loads-aws
      • loads-web
  • Processes
    • Search for processes owned by ubuntu, loads, nginx, circus
  • Logs
    • /var/log/redis
    • /var/log/nginx
  • QA access
    • You need special access to be able to SSH into these devices
    • You need to make some changes to your .ssh/config file

Loads V1 Cluster Monitoring

Agents statuses
Launch a health check on all agents

Loads V1 Cluster Maintenance

  • If things should go wrong...
  • Checking the cluster dashboard
  • TBD
  • Checking the stack
  • TBD
  • Restarting the Master/Broker
  • TBD
  • Restarting the Slaves/Agents
  • TBD

Repos

Bugs

Documentation

Loads Cluster Dashboard

Deployment and AWS Instances

  • Master, two slaves in US West
  • loads-master (broker and agent processes)
  • loads-slave-1 (agent processes)
  • loads-slave-2 (agent processes)
  • Note: there is no CF stack or ELB for this cluster
  • Note: the load cluster state/health can be check directly from the dashboard (see above)

Monitoring the cluster via Stackdriver

Monitoring the Loads Cluster via the Dashboard

Monitoring the Stage environment during Load Tests

  • We have various dashboards created by OPs that capture and display all sorts of data via the Heka/ES/Kibana pipeline
    • Heka dashboard
    • Kibana dashboard
    • Stackdriver

Load Test Results

  • Load test results are always listed in the dashboard.
  • A clean up of the dashboard is a high priority for V2 - we want a much better/accurate representation of the test(s) run with the kind of human-readable results that provide additional meaning/context to the metrics provided by the various OPs dashboards

Reporting (or lack of it)

  • There were plans to create some reporting in the style of what we had with the Funkload tool.
  • There are bugs open about getting some reporting out of Loads.
  • No action taken at this time, but a very good candidate for V2.

QA Wikis

Current projects using Loads

New/Planned projects using Loads

  • SimplePush (probably)
  • Tiles (maybe)

Other projects doing load testing

Vaurien