Notes

A issue from January 2015, infernyx dropped, causing the data pipeline to stop working

warning notifications didn’t trigger

was it process or technical detail?

a bit of both, process wasn’t clear on what did and did not need to be monitored, and some systems were just not monitored or not adequately monitored

what can we do to make sure that is tested/noticed

adding more notifications and monitoring to all systems, will be done before end of q1 (if not sooner) raised the need for integration/component testing for data flows

processes around releases, reliability

what happens when we get more than 16k rps (the current expected limitation of the system)?

sqs can’t work across multiple dc's

splice will become 2 apps (see arch. diagram)

red = admin
blue = public (sales, advertisers)

when we add the public ui, how do we test permissions?

various logins with permission ‘sets’ configured

will soon be storing more campaign related data

start end dates, advertisers, cost, etc.

Priority of upcoming changes:

1st. alerting, notifications, reliability
2nd. related tiles
3rd. sqs pipeline refactor
4th. admin and public facing splice interfaces

data policy

contains personally identifiable info (right now ip’s and tile distribution)
possible search terms

ops and process?

travis -> daniel and benson
smooth out dev -> qa -> ops
bring staging closer to production
configuration differences?
how to determine that all modules work together and that dependencies are met
notifications? stackdriver and pagerduty? cloud watch
should become one team or more tightly integrated?

no e2e tests, regression

test various requests with a variety of UA, end 2 end
load tool to fire 100k impressions (required to get data processing)
required 120 log files to spin up (size of each log file is not that important)

staging is not entirely set

uses disco - lightweight map reduce pipeline http://discoproject.org/

OPS chat:

What are you going to be looking at doing over the next quarter or two?

Ops and process?

travis -> daniel and benson
deploying from jenkins from dev -> stage and then to stage -> prod
general ops jenkins type work for the quarter

need to test beyond scale of loadsv1

need to test slow ramp up over time
refresh dns over that period

how can we smooth out dev -> qa -> ops
need to bring staging closer to production
are there any configuration differences? is it under puppet? yes it is, it will be fairly identical in both environments

automated testing goal: need to determine that all modules work together and that dependencies are met

test that all the ports on all the hosts are reachable
upload a test file into mapreduce
can upload a log to onyx, verify that the log is added to queue list and that it ends up in mapreduce
run a job from scheduling against the test file
can connect to redshift and authenticate
can connect to postgres as each type of user (i assume there are admin, read/write, and read only users?)

some tests are kind of monitoring the system in addition to test the functionality (like can connect to redshift and auth), these tests could be run in production and an alert could be trigger in stack driver if it fails

cron job to run tests, upload result status and trigger alert if failure?

how are alerts and notifications done?

stackdriver and pager duty, cloud watch
stackdriver, check that the node can get traffic, heartbeat
app level: load balancer should shut that down if its returning 500

data dog coming on at some point

devs are thinking about redoing some existing servers adding new servers

How are deployments done?

redoing pipeline to standard ops jenkins pipeline

Next Steps

Note: We should push for content services to hire their own QA person (whether under my team or directly managed by them). In the interim, Karl will be available as much as he can for this.

Related tiles acceptance testing

does it meet the specs or design docs (request from dev/product)
shows and doesn’t show when appropriate, does it behave correctly (interactions), show the correct tile when it should, handles load/performance, etc.
load and regression testing for all of tiles

Admin and public facing UI

will need to create a full test plan for this, particularly as it is public facing, some items include:

test login/log out, account creation
permission validation for various account types (what are the account types, what are the permissions break downs for each?)
asset pipeline functionality (can i upload images, image types, file size limits)
campaign management (start/end dates, cost, name, etc.)
security concerns (confer with Yvan Boily’s team)
Automated tests, monitoring, for the pipeline as outlined in the quality overview doc
Failure testing to test that a) the notifications fire and b) we can recover easily

Tiles/Testing Notes

Contents

Notes

Next Steps

Related tiles acceptance testing

Admin and public facing UI

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools