TestEngineering/Services/FxALoadTesting

From MozillaWiki
Jump to: navigation, search

Quick Verification Of Stage Deployments

  • This is a quick sanity test of the environment before getting started on load tests.
Install FxA-Auth-Server to a local host or an AWS instance (see below)
$ cd fxa-auth-server
Run the integration tests against the remote Stage server (load balancer)
$ PUBLIC_URL=<FxA Stage> npm run test-remote
Current example:
$ PUBLIC_URL=https://api-accounts.stage.mozaws.net npm run test-remote
  • NOTE: Make sure to install and test from the same branch that is deployed to Stage (ie do not use Master for running the tests against Stage or Production).

Quick Verification Of Production Deployments

  • This is a quick sanity test of the environment after each new deployment. There are other verifications that can be run as well
Install FxA-Auth-Server to a local host or an AWS instance (see below)
$ cd fxa-auth-server
Run the integration tests against the remote Stage server (load balancer)
$ PUBLIC_URL=<FxA Prod> npm run test-remote
Current example:
$ PUBLIC_URL=https://api.accounts.firefox.com npm run test-remote
  • NOTE: Make sure to install and test from the same branch that is deployed to Production.

Load Test Tool Client/Host

Installing FxA-Auth-Server and the Loads tool on Localhost or AWS

Installation:
$ git clone https://github.com/mozilla/fxa-auth-server.git
$ cd ./fxa-auth-server
Note: You may want to install a specific branch for testing vs defaulting to Master
$ npm install
$ npm test
$ cd ./test/load
$ make build
  • Note: 'npm install' may need to be run now as root.
  • Note: This will install a local copy of the Loads tool for use with FxA-Auth-Server.

Running the Loads tool against FxA Stage

  • The basic load test can be run as follows
$ make test SERVER_URL=https://api-accounts.stage.mozaws.net
  • The full, default load test can be run as follows
$ make bench SERVER_URL=https://api-accounts.stage.mozaws.net

Note: the current version of 'make bench' tends to use a lot of CPU and Memory on the localhost.    
The recommendation is to use 'make test' and 'make megabench' instead (see below)...
  • Configuring the bench load test - config folder:
    • The test.ini file (for make test) can be configured for the following:
      • Number of hits
      • Number of concurrent users
    • The bench.ini file (for make bench) can be configured for the following:
      • Number of concurrent users
      • Duration of test
  • For both tests, start with the defaults, then tweak the duration. Users and Agents are optional tweaks/changes. Also, we can configure the bench load test to run in detached mode with an appropriate loads detach and observer settings.

Running the Loads tool against FxA Development or Production

  • This can be done if we are comparing Stage vs. some other environment and have access to the AWS logs in Dev or Production:
  • Dev:
$ make test SERVER_URL=https://accounts.dev.lcip.org
$ make bench SERVER_URL=https://accounts.dev.lcip.org
  • Prod:
$ make test SERVER_URL=https://api.accounts.firefox.com
$ make bench SERVER_URL=https://api.accounts.firefox.com
  • The same optional configuration changes apply here.

Using the Loads V1 Services Cluster

  • By using the Loads Services Cluster, we can offload the broker/agents processes and save client-side CPU and memory.
  • Changes were made to Makefile and the load test to use the cluster and some associated config files (for test, bench, megabench).
  • Testing against the Stage environment:
$ make megabench SERVER_URL=https://api-accounts.stage.mozaws.net
  • Testing against the Dev environment:
$ make megabench SERVER_URL=https://api-accounts.dev.lcip.org
  • Testing against the Prod enviornment:
$ make megabench SERVER_URL=https://api.accounts.firefox.com
  • Configuring the megabench load test - config folder:
    • The megabench.ini file (for make megabench) can be configured for the following:
      • Number of concurrent users
      • Duration of test
      • Include file (leave as defined for now)
      • Python dependencies (leave as defined for now)
      • Broker to use for testing (leaves as defined for now - this is broker in the Loads Cluster)
      • Agents to use for testing (default is 5, max is currently 20, but depends on the number of concurrent load tests running)
      • Detach mode (leave as defined for now to automatically detach from the load test once it starts on the localhost)
      • Observer (this can be email or irc - the default is irc #services-dev channel)

Configuring The Load Tests

  • Makefile
    • The SERVER_URL constant can be changed.
  • Config files
    • For make test:
      • Number of hits
      • Number of concurrent users
    • For make bench:
      • Number of concurrent users
      • Duration of test
    • For make megabench:
      • Number of concurrent users
      • Duration of test
      • Include file (this is code dependent)
      • Python dependencies (this is code dependent)
      • Broker to use for testing (leaves as defined for now - this is broker in the Loads Cluster)
      • Agents to use for testing (default is 5, max is currently 20, but depends on the number of concurrent load tests running)
      • Detach mode (leave as defined for now to automatically detach from the load test once it starts on the localhost)
      • Observer (this can be email or irc - the default is irc #services-dev channel)

Test Coverage and Stats

  • Basic tweakable values for all load tests
    • users = number of concurrent users/agent
    • agents = number of agents out of the cluster, otherwise errors out
    • duration = in seconds
    • hits = 1 or X number of rounds/hits/iterations
  • Location fxa-auth-server/loadtest/loadtests.py
  • The following items are covered in the load test
    • test_auth_server is the main entry point in the loadtests.py file
      • account creation
      • session creation
    • account deletion
    • session deletion
  • Integration tests
    • These are designed to cover the edge/error cases that are not applicable to the load test
    • The tests can be run against a remote server

Analyzing the Results

  • TBD

Debugging the Issues

  • There are several methods and tools for debugging the load test errors and other issues.
  • 1. Important logs for FxA-Auth-Server (per server)
    • /media/ephemeral0/fxa-auth-server/auth_err.log.*
    • /media/ephemeral0/fxa-auth-server/auth_out.log
    • /media/ephemeral0/heka/hekad_err.log
    • /media/ephemeral0/heka/hekad_out.log
    • /media/ephemeral0/nginx/logs/access.log
    • /media/ephemeral0/nginx/logs/error.log
  • Acceptable FxA-Auth-Server errors
503s: especially of this type - /v1/certificate/sign - are usually a sign that we are overloading the hosts

400s: we should never see these in the logs, especially if the "errno" value is 105. 
    Check the fxa-auth-server/auth_err.log
400s: "errno" values of 101, 102 are ok. These can be expected during a load test.

ELB issues: we may see 503s and corresponding "err":"cannot enqueue work: maximum backlog exceeded (30)" 
    messages if one or more of the hosts behind the ELB is receiving most of the load traffic.
REF: https://github.com/mozilla/fxa-auth-server/issues/647

Monitoring FxA Stage

Agents statuses
Launch a health check on all agents

Performance Testing Information

  • TBD

Details on the Load Test tool

Known Bugs, Issues, and Tasks

  • Bugzilla
    • No specific category

Capacity Planning Stage and Production

  • QA is tasked with providing some capacity requirements and constraints based on repeated load testing of the FxA-Auth-Server Stage environment.
  • The goal is to be able to work with OPs to develop a realistic plan for deploying and maintaining the production environment at a level expected for projected user traffic, etc.
  • Brainstorming the QA role:
    • QA needs to get some realistic numbers from the Product team. This could be as simple as traffic flow (number of users per day or per segments of the day - peaks and valleys) or more detailed:
      • Traffic flow - QPS/RPS
      • Average number of users per time segment
      • Average and peak latency
      • Error percentages and thresholds
      • etc
    • QA gets help from OPs to learn how to measure those required numbers/values using StackDriver or other tools (or to get data from OPsView). If those numbers can not be measured then we either need to
      • get a different set of data points from the Product team
      • enhance the current tools to track and measure the required data
    • QA does repeated, scheduled, well-defined load tests in Stage while actively monitoring the results, logs, data, etc.
    • QA finds a stable configuration that - when scaled - would
      • match the needs of Product when we release
      • match the realisitc capacity planning that OPs normally does
  • Dependencies
    • Realistic traffic/user numbers from the FxA Product team
    • Timely training on monitoring tools from the OPs team
    • Regular and realistic scaling/testing of deployments to Stage by QA given our current pre-release and post-release schedules

References