A issue from January 2015, infernyx dropped, causing the data pipeline to stop working
- warning notifications didn’t trigger
was it process or technical detail?
a bit of both, process wasn’t clear on what did and did not need to be monitored, and some systems were just not monitored or not adequately monitored
what can we do to make sure that is tested/noticed
adding more notifications and monitoring to all systems, will be done before end of q1 (if not sooner) raised the need for integration/component testing for data flows
processes around releases, reliability
what happens when we get more than 16k rps (the current expected limitation of the system)?
sqs can’t work across multiple dc's
splice will become 2 apps (see arch. diagram)
- red = admin
- blue = public (sales, advertisers)
when we add the public ui, how do we test permissions?
- various logins with permission ‘sets’ configured
will soon be storing more campaign related data
- start end dates, advertisers, cost, etc.
Priority of upcoming changes:
- 1st. alerting, notifications, reliability
- 2nd. related tiles
- 3rd. sqs pipeline refactor
- 4th. admin and public facing splice interfaces
- contains personally identifiable info (right now ip’s and tile distribution)
- possible search terms
ops and process?
- travis -> daniel and benson
- smooth out dev -> qa -> ops
- bring staging closer to production
- configuration differences?
- how to determine that all modules work together and that dependencies are met
- notifications? stackdriver and pagerduty? cloud watch
- should become one team or more tightly integrated?
no e2e tests, regression
- test various requests with a variety of UA, end 2 end
- load tool to fire 100k impressions (required to get data processing)
- required 120 log files to spin up (size of each log file is not that important)
staging is not entirely set
uses disco - lightweight map reduce pipeline http://discoproject.org/
What are you going to be looking at doing over the next quarter or two?
Ops and process?
- travis -> daniel and benson
- deploying from jenkins from dev -> stage and then to stage -> prod
- general ops jenkins type work for the quarter
need to test beyond scale of loadsv1
- need to test slow ramp up over time
- refresh dns over that period
- how can we smooth out dev -> qa -> ops
- need to bring staging closer to production
- are there any configuration differences? is it under puppet? yes it is, it will be fairly identical in both environments
automated testing goal: need to determine that all modules work together and that dependencies are met
- test that all the ports on all the hosts are reachable
- upload a test file into mapreduce
- can upload a log to onyx, verify that the log is added to queue list and that it ends up in mapreduce
- run a job from scheduling against the test file
- can connect to redshift and authenticate
- can connect to postgres as each type of user (i assume there are admin, read/write, and read only users?)
some tests are kind of monitoring the system in addition to test the functionality (like can connect to redshift and auth), these tests could be run in production and an alert could be trigger in stack driver if it fails
- cron job to run tests, upload result status and trigger alert if failure?
how are alerts and notifications done?
- stackdriver and pager duty, cloud watch
- stackdriver, check that the node can get traffic, heartbeat
- app level: load balancer should shut that down if its returning 500
data dog coming on at some point
devs are thinking about redoing some existing servers adding new servers
How are deployments done?
- redoing pipeline to standard ops jenkins pipeline
Note: We should push for content services to hire their own QA person (whether under my team or directly managed by them). In the interim, Karl will be available as much as he can for this.
Related tiles acceptance testing
- does it meet the specs or design docs (request from dev/product)
- shows and doesn’t show when appropriate, does it behave correctly (interactions), show the correct tile when it should, handles load/performance, etc.
- load and regression testing for all of tiles
Admin and public facing UI
will need to create a full test plan for this, particularly as it is public facing, some items include:
- test login/log out, account creation
- permission validation for various account types (what are the account types, what are the permissions break downs for each?)
- asset pipeline functionality (can i upload images, image types, file size limits)
- campaign management (start/end dates, cost, name, etc.)
- security concerns (confer with Yvan Boily’s team)
- Automated tests, monitoring, for the pipeline as outlined in the quality overview doc
- Failure testing to test that a) the notifications fire and b) we can recover easily