TestEngineering/Services/FxALoadTesting: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
 
(38 intermediate revisions by one other user not shown)
Line 1: Line 1:
== Summary for FxA-Auth-Server ==
* Latest Results
** Link to loads cluster: https://loads.services.mozilla.com/
*** Note: this now requires login privileges and a password
** Snapshots from StackDriver - TBD
** Snapshots from Kibana - TBD
* Latest Deployments
** https://bugzilla.mozilla.org/show_bug.cgi?id=990393
* In Progress
** Load testing for scale and saturation
*** Focus is on https://bugzilla.mozilla.org/show_bug.cgi?id=982408
*** Focus is also on developing some RPS/QPS and some error/threshold % values by mid-Beta
** Scaling: Size and Shape of the FxA stacks
** Bug review and issue debug (see list below)
* Bugs To verify
** None at this time
* Planned
** Focus on monitoring and logging
** Scaling for traffic
* Blockers
** None at this time
* Completed
** First-round load testing of Stage environment
** Second-round load testing of Stage environment
** Link to previous results (short): http://loads.services.mozilla.com/
* Performance
** TBD
== Quick Verification Of Stage Deployments ==
== Quick Verification Of Stage Deployments ==
* This is a quick sanity test of the environment before getting started on load tests.
* This is a quick sanity test of the environment before getting started on load tests.
Line 37: Line 9:
* NOTE: Make sure to install and test from the same branch that is deployed to Stage (ie do not use Master for running the tests against Stage or Production).
* NOTE: Make sure to install and test from the same branch that is deployed to Stage (ie do not use Master for running the tests against Stage or Production).


== Load Test Tool Client/Host ==
* Using TPS
* It is always best to configure an AWS instance as the load test tool client/host.
** The TPS FxA/Sync automated tests can be used as well, but the following file will have to be edited to add Stage environment configuration parameters: https://github.com/mozilla/gecko-dev/blob/master/testing/tps/tps/testrunner.py
* The actual Load Test broker and agents run in the Load Test environment set up by rfkelly. See the following wiki:
** See the following wiki page for more information: https://wiki.mozilla.org/User_Services/Sync/Run_TPS
** https://wiki.mozilla.org/QA/Services/FxATestEnvironments#FxA_Load_Test_Environment
** See also: https://bugzilla.mozilla.org/show_bug.cgi?id=1006675


=== Creating a RHEL AWS instance ===
== Quick Verification Of Production Deployments ==
* Pick a Region then Create Instance > Launch Instance
* This is a quick sanity test of the environment after each new deployment. There are other verifications that can be run as well
* Follow the prompts to create a basic, RHEL-flavored instance
Install FxA-Auth-Server to a local host or an AWS instance (see below)
* Use of the QA/Dev key pairs that have been set up for this:
$ cd fxa-auth-server
** US East Key Pair: QA-Dev-Share (created by jbonacci) for general use
Run the integration tests against the remote Stage server (load balancer)
** US West Key Pair: QA-dev-share (created by RaFromBRC) for general use
$ PUBLIC_URL=<FxA Prod> npm run test-remote
* Once the instance is running, log in as "ec2-user"
Current example:
$ PUBLIC_URL=https://api.accounts.firefox.com npm run test-remote
* NOTE: Make sure to install and test from the same branch that is deployed to Production.


* The following apps, tools, and libs will need to be installed for use with various Services applications:
== Load Test Tool Client/Host ==
** gcc, gcc-c++
* It is always best to configure an AWS instance as the host for all load testing.
** hg
* All load tests can now run on the localhost (the AWS instance) or against the new Loads Cluster. See the following links for more information:  
** git
** https://wiki.mozilla.org/QA/Services/LoadsV1ClientTestHost
** python-devel
** https://wiki.mozilla.org/QA/Services/LoadsToolsAndTesting1
** automake, autoconf, and libtool  (required for libzmq, for easy_install)
** pip
** virtualenv
** node/npm
** zeromq 3.X
** gmp, gmp-devel
 
* Also, general rhel updates:
$ sudo yum -y update
and/or
$ sudo yum -y upgrade
 
* Now, the instance should be ready for installing and using the Loads tool.
 
=== Creating an Ubuntu AWS instance ===
* Pick a Region then Create Instance > Launch Instance
* Follow the prompts to create a basic, Ubuntu-flavored instance
* Use of the QA/Dev key pairs that have been set up for this:
** US East Key Pair: QA-Dev-Share (created by jbonacci) for general use
** US West Key Pair: QA-dev-share (created by RaFromBRC) for general use
* Once the instance is running, log in as "ubuntu"
 
* The following apps, tools, and libs will need to be installed for use with various Services applications:
** gcc, g++
** mercurial
** git
** python-setuptools, python-virtualenv, and python-dev
** automake, autoconf, libtool
** m4
** node/npm
** libzmq and zeromq 3.X
** gmp-5.1.3 or newer
 
* Also, general rhel updates:
$ sudo apt-get update
and/or
$ sudo apt-get upgrade
 
* Now, the instance should be ready for installing and using the Loads tool.


== Installing the Loads tool on the AWS instance via FxA-Auth-Server ==
== Installing FxA-Auth-Server and the Loads tool on Localhost or AWS ==
  Installation:
  Installation:
  $ git clone https://github.com/mozilla/fxa-auth-server.git
  $ git clone https://github.com/mozilla/fxa-auth-server.git
  $ cd ./fxa-auth-server
  $ cd ./fxa-auth-server
Note: You may want to install a specific branch for testing vs defaulting to Master
  $ npm install
  $ npm install
  $ npm test
  $ npm test
  $ cd ./loadtest
  $ cd ./test/load
  $ make build
  $ make build


* Note: 'npm install' may need to be run now as root.
* Note: This will install a local copy of the Loads tool for use with FxA-Auth-Server.
* Note: This will install a local copy of the Loads tool for use with FxA-Auth-Server.


Line 112: Line 49:
* The full, default load test can be run as follows
* The full, default load test can be run as follows
  $ make bench SERVER_URL=https://api-accounts.stage.mozaws.net
  $ make bench SERVER_URL=https://api-accounts.stage.mozaws.net
Note: the current version of 'make bench' tends to use a lot of CPU and Memory on the localhost.   
The recommendation is to use 'make test' and 'make megabench' instead (see below)...


* Configuring the bench load test - config folder:
* Configuring the bench load test - config folder:
Line 138: Line 78:
* The same optional configuration changes apply here.
* The same optional configuration changes apply here.


== Using the Loads Services Cluster ==
== Using the Loads V1 Services Cluster ==
* By using the Loads Services Cluster, we can offload the broker/agents processes and save client-side CPU and memory.
* By using the Loads Services Cluster, we can offload the broker/agents processes and save client-side CPU and memory.
* Changes were made to Makefile and the load test to use the cluster and some associated config files (for test, bench, megabench).  
* Changes were made to Makefile and the load test to use the cluster and some associated config files (for test, bench, megabench).  
Line 162: Line 102:
*** Observer (this can be email or irc - the default is irc #services-dev channel)
*** Observer (this can be email or irc - the default is irc #services-dev channel)


* REF: https://wiki.mozilla.org/QA/Services/FxATestEnvironments#Loads_Services_Cluster_Environment
* REF: https://wiki.mozilla.org/QA/Services/LoadsToolsAndTesting1
* REF: https://github.com/mozilla/fxa-auth-server/tree/master/loadtest/config
* REF: https://github.com/mozilla/fxa-auth-server/tree/master/loadtest/config
== Configuring The Load Tests ==
* Makefile
** The SERVER_URL constant can be changed.
* Config files
** For make test:
*** Number of hits
*** Number of concurrent users
** For make bench:
*** Number of concurrent users
*** Duration of test
** For make megabench:
*** Number of concurrent users
*** Duration of test
*** Include file (this is code dependent)
*** Python dependencies (this is code dependent)
*** Broker to use for testing (leaves as defined for now - this is broker in the Loads Cluster)
*** Agents to use for testing (default is 5, max is currently 20, but depends on the number of concurrent load tests running)
*** Detach mode (leave as defined for now to automatically detach from the load test once it starts on the localhost)
*** Observer (this can be email or irc - the default is irc #services-dev channel)
* Load Test code: loadtests.py
** The load test can be configured in the code - see the following lines:
** https://github.com/mozilla/fxa-auth-server/blob/master/test/load/loadtests.py#L17-L39
* General REFs:
** https://github.com/mozilla/fxa-auth-server/blob/master/test/load/loadtests.py


== Test Coverage and Stats ==
== Test Coverage and Stats ==
Line 213: Line 183:
** or http://loads.services.mozilla.com
** or http://loads.services.mozilla.com
* Cluster status
* Cluster status
** Check from any loadtest folder or loads install:
** Check directly from the Loads Cluster dashboard:
  ../bin/loads-runner --ping-broker --ssh=ubuntu@loads.services.mozilla.com
  Agents statuses
  ../bin/loads-runner --check-cluster --ssh=ubuntu@loads.services.mozilla.com
  Launch a health check on all agents
(or similar)
* and also on StackDriver: https://app.stackdriver.com/groups/6664/stage-loads-cluster
 
* OPs has set up Kibana dashboards for monitoring Stage:
** Kibana: https://kibana.fxa.us-east-1.stage.mozaws.net/#/dashboard
** Kibana: https://kibana.fxa.us-east-1.stage.mozaws.net/#/dashboard/file/weblogs.json
** Heka: https://heka.fxa.us-east-1.stage.mozaws.net/#health
** Note: Make sure to have the Mozilla Root Cert set up in your browser: https://wiki.mozilla.org/MozillaRootCertificate
** Note: We can, with proper privileges, SSH into the log aggregator and restart the elasticsearch and hekad processes.


* OPs is also working on StackDriver support for Stage...
* For all other monitoring, see the following section:
** https://app.stackdriver.com/
** https://wiki.mozilla.org/QA/Services/FxATestEnvironments#Monitoring_the_Stage_Environment
** https://app.stackdriver.com/groups/4393/stage-fxa


== Performance Testing Information ==
== Performance Testing Information ==
Line 235: Line 197:
* The documentation can be found here:
* The documentation can be found here:
** https://loads.readthedocs.org/en/latest
** https://loads.readthedocs.org/en/latest
** The most useful information here is running in detached mode and using an observer
*** https://loads.readthedocs.org/en/latest/distributed/#detach-mode
*** https://loads.readthedocs.org/en/latest/commands
*** and here: https://loads.readthedocs.org/en/latest/internals/?highlight=observer
* The repositories are here:
* The repositories are here:
** https://github.com/mozilla-services/loads
** https://github.com/mozilla-services/loads
** https://github.com/mozilla-services/loads.js
** https://github.com/mozilla-services/loads-aws
** https://github.com/mozilla-services/loads-web
* The Services cluster is here:
* The Services cluster is here:
** http://loads.services.mozilla.com
** http://loads.services.mozilla.com
Line 247: Line 206:
== Known Bugs, Issues, and Tasks ==
== Known Bugs, Issues, and Tasks ==
* FxA
* FxA
** https://github.com/mozilla/fxa-auth-server/issues/645
** https://github.com/mozilla/fxa-auth-server/issues
** https://github.com/mozilla/fxa-auth-server/issues/647
** https://github.com/mozilla/fxa-content-server/issues
** and several others


** Meta: https://bugzilla.mozilla.org/show_bug.cgi?id=907475
* Bugzilla
** Meta: https://bugzilla.mozilla.org/show_bug.cgi?id=907494
** No specific category
** Load Testing: https://bugzilla.mozilla.org/show_bug.cgi?id=982408
** https://bugzilla.mozilla.org/show_bug.cgi?id=982193


* Infrastructure
* Infrastructure
** https://github.com/mozilla-services/puppet-config/pull/223
** https://github.com/mozilla-services/puppet-config/issues
** https://github.com/mozilla-services/puppet-config/pull/283
** https://github.com/mozilla-services/svcops/issues


* Loads Tool and clusters
* Loads Tool and clusters
** https://github.com/mozilla-services/loads/issues/151
** https://github.com/mozilla-services/loads/issues
** https://github.com/mozilla-services/loads/issues/214
** https://github.com/mozilla-services/loads-web/issues
** https://github.com/mozilla-services/loads/issues/216
** https://github.com/mozilla-services/loads-aws/issues
** https://github.com/mozilla-services/loads/issues/217
** https://github.com/mozilla-services/loads/issues/222
** https://github.com/mozilla-services/loads/issues/225
** https://github.com/mozilla-services/loads/issues/229
** https://github.com/mozilla-services/loads/issues/233
** https://github.com/mozilla-services/loads/issues/234
** https://github.com/mozilla-services/loads/issues/235
** https://github.com/mozilla-services/loads/issues/236
** https://github.com/mozilla-services/loads/issues/244
** https://github.com/mozilla-services/loads/issues/246
** https://github.com/mozilla-services/loads/issues/247
** https://github.com/mozilla-services/loads/issues/248
** https://github.com/mozilla-services/loads-web/issues/22
** https://github.com/mozilla-services/loads-web/issues/24


== Capacity Planning Stage and Production ==
== Capacity Planning Stage and Production ==
Line 304: Line 248:
== References ==
== References ==
* Repository: https://github.com/mozilla/fxa-auth-server
* Repository: https://github.com/mozilla/fxa-auth-server
* The QA Test Environments: https://wiki.mozilla.org/QA/Services/FxATestEnvironments
* The QA Test Environments:
** https://wiki.mozilla.org/QA/Services/FxATestEnvironments
** https://wiki.mozilla.org/QA/Services/TSVerifierSyncTestEnvironments
* Deploying the FxA Load Test environment for broker/agents usage:
* Deploying the FxA Load Test environment for broker/agents usage:
** https://github.com/mozilla/fxa-deployment
** https://github.com/mozilla/fxa-deployment
* OPs pages for stats collection, logging, monitoring
* OPs pages for stats collection, logging, monitoring
** TBD
** TBD

Latest revision as of 20:01, 26 August 2016

Quick Verification Of Stage Deployments

  • This is a quick sanity test of the environment before getting started on load tests.
Install FxA-Auth-Server to a local host or an AWS instance (see below)
$ cd fxa-auth-server
Run the integration tests against the remote Stage server (load balancer)
$ PUBLIC_URL=<FxA Stage> npm run test-remote
Current example:
$ PUBLIC_URL=https://api-accounts.stage.mozaws.net npm run test-remote
  • NOTE: Make sure to install and test from the same branch that is deployed to Stage (ie do not use Master for running the tests against Stage or Production).

Quick Verification Of Production Deployments

  • This is a quick sanity test of the environment after each new deployment. There are other verifications that can be run as well
Install FxA-Auth-Server to a local host or an AWS instance (see below)
$ cd fxa-auth-server
Run the integration tests against the remote Stage server (load balancer)
$ PUBLIC_URL=<FxA Prod> npm run test-remote
Current example:
$ PUBLIC_URL=https://api.accounts.firefox.com npm run test-remote
  • NOTE: Make sure to install and test from the same branch that is deployed to Production.

Load Test Tool Client/Host

Installing FxA-Auth-Server and the Loads tool on Localhost or AWS

Installation:
$ git clone https://github.com/mozilla/fxa-auth-server.git
$ cd ./fxa-auth-server
Note: You may want to install a specific branch for testing vs defaulting to Master
$ npm install
$ npm test
$ cd ./test/load
$ make build
  • Note: 'npm install' may need to be run now as root.
  • Note: This will install a local copy of the Loads tool for use with FxA-Auth-Server.

Running the Loads tool against FxA Stage

  • The basic load test can be run as follows
$ make test SERVER_URL=https://api-accounts.stage.mozaws.net
  • The full, default load test can be run as follows
$ make bench SERVER_URL=https://api-accounts.stage.mozaws.net

Note: the current version of 'make bench' tends to use a lot of CPU and Memory on the localhost.    
The recommendation is to use 'make test' and 'make megabench' instead (see below)...
  • Configuring the bench load test - config folder:
    • The test.ini file (for make test) can be configured for the following:
      • Number of hits
      • Number of concurrent users
    • The bench.ini file (for make bench) can be configured for the following:
      • Number of concurrent users
      • Duration of test
  • For both tests, start with the defaults, then tweak the duration. Users and Agents are optional tweaks/changes. Also, we can configure the bench load test to run in detached mode with an appropriate loads detach and observer settings.

Running the Loads tool against FxA Development or Production

  • This can be done if we are comparing Stage vs. some other environment and have access to the AWS logs in Dev or Production:
  • Dev:
$ make test SERVER_URL=https://accounts.dev.lcip.org
$ make bench SERVER_URL=https://accounts.dev.lcip.org
  • Prod:
$ make test SERVER_URL=https://api.accounts.firefox.com
$ make bench SERVER_URL=https://api.accounts.firefox.com
  • The same optional configuration changes apply here.

Using the Loads V1 Services Cluster

  • By using the Loads Services Cluster, we can offload the broker/agents processes and save client-side CPU and memory.
  • Changes were made to Makefile and the load test to use the cluster and some associated config files (for test, bench, megabench).
  • Testing against the Stage environment:
$ make megabench SERVER_URL=https://api-accounts.stage.mozaws.net
  • Testing against the Dev environment:
$ make megabench SERVER_URL=https://api-accounts.dev.lcip.org
  • Testing against the Prod enviornment:
$ make megabench SERVER_URL=https://api.accounts.firefox.com
  • Configuring the megabench load test - config folder:
    • The megabench.ini file (for make megabench) can be configured for the following:
      • Number of concurrent users
      • Duration of test
      • Include file (leave as defined for now)
      • Python dependencies (leave as defined for now)
      • Broker to use for testing (leaves as defined for now - this is broker in the Loads Cluster)
      • Agents to use for testing (default is 5, max is currently 20, but depends on the number of concurrent load tests running)
      • Detach mode (leave as defined for now to automatically detach from the load test once it starts on the localhost)
      • Observer (this can be email or irc - the default is irc #services-dev channel)

Configuring The Load Tests

  • Makefile
    • The SERVER_URL constant can be changed.
  • Config files
    • For make test:
      • Number of hits
      • Number of concurrent users
    • For make bench:
      • Number of concurrent users
      • Duration of test
    • For make megabench:
      • Number of concurrent users
      • Duration of test
      • Include file (this is code dependent)
      • Python dependencies (this is code dependent)
      • Broker to use for testing (leaves as defined for now - this is broker in the Loads Cluster)
      • Agents to use for testing (default is 5, max is currently 20, but depends on the number of concurrent load tests running)
      • Detach mode (leave as defined for now to automatically detach from the load test once it starts on the localhost)
      • Observer (this can be email or irc - the default is irc #services-dev channel)

Test Coverage and Stats

  • Basic tweakable values for all load tests
    • users = number of concurrent users/agent
    • agents = number of agents out of the cluster, otherwise errors out
    • duration = in seconds
    • hits = 1 or X number of rounds/hits/iterations
  • Location fxa-auth-server/loadtest/loadtests.py
  • The following items are covered in the load test
    • test_auth_server is the main entry point in the loadtests.py file
      • account creation
      • session creation
    • account deletion
    • session deletion
  • Integration tests
    • These are designed to cover the edge/error cases that are not applicable to the load test
    • The tests can be run against a remote server

Analyzing the Results

  • TBD

Debugging the Issues

  • There are several methods and tools for debugging the load test errors and other issues.
  • 1. Important logs for FxA-Auth-Server (per server)
    • /media/ephemeral0/fxa-auth-server/auth_err.log.*
    • /media/ephemeral0/fxa-auth-server/auth_out.log
    • /media/ephemeral0/heka/hekad_err.log
    • /media/ephemeral0/heka/hekad_out.log
    • /media/ephemeral0/nginx/logs/access.log
    • /media/ephemeral0/nginx/logs/error.log
  • Acceptable FxA-Auth-Server errors
503s: especially of this type - /v1/certificate/sign - are usually a sign that we are overloading the hosts

400s: we should never see these in the logs, especially if the "errno" value is 105. 
    Check the fxa-auth-server/auth_err.log
400s: "errno" values of 101, 102 are ok. These can be expected during a load test.

ELB issues: we may see 503s and corresponding "err":"cannot enqueue work: maximum backlog exceeded (30)" 
    messages if one or more of the hosts behind the ELB is receiving most of the load traffic.
REF: https://github.com/mozilla/fxa-auth-server/issues/647

Monitoring FxA Stage

Agents statuses
Launch a health check on all agents

Performance Testing Information

  • TBD

Details on the Load Test tool

Known Bugs, Issues, and Tasks

  • Bugzilla
    • No specific category

Capacity Planning Stage and Production

  • QA is tasked with providing some capacity requirements and constraints based on repeated load testing of the FxA-Auth-Server Stage environment.
  • The goal is to be able to work with OPs to develop a realistic plan for deploying and maintaining the production environment at a level expected for projected user traffic, etc.
  • Brainstorming the QA role:
    • QA needs to get some realistic numbers from the Product team. This could be as simple as traffic flow (number of users per day or per segments of the day - peaks and valleys) or more detailed:
      • Traffic flow - QPS/RPS
      • Average number of users per time segment
      • Average and peak latency
      • Error percentages and thresholds
      • etc
    • QA gets help from OPs to learn how to measure those required numbers/values using StackDriver or other tools (or to get data from OPsView). If those numbers can not be measured then we either need to
      • get a different set of data points from the Product team
      • enhance the current tools to track and measure the required data
    • QA does repeated, scheduled, well-defined load tests in Stage while actively monitoring the results, logs, data, etc.
    • QA finds a stable configuration that - when scaled - would
      • match the needs of Product when we release
      • match the realisitc capacity planning that OPs normally does
  • Dependencies
    • Realistic traffic/user numbers from the FxA Product team
    • Timely training on monitoring tools from the OPs team
    • Regular and realistic scaling/testing of deployments to Stage by QA given our current pre-release and post-release schedules

References