Auto-tools/Projects/Pulse/PulseGuardian: Difference between revisions

No edit summary
No edit summary
 
(5 intermediate revisions by one other user not shown)
Line 1: Line 1:
= Status =
= Status =


Deployment tracker: {{bug|1015037}}.
PulseGuardian is available at https://pulseguardian.mozilla.org.


Code can be found at https://github.com/mozilla/pulseguardian.
Code can be found at https://github.com/mozilla-services/pulseguardian.


= Team =
= Team =


* mcote, dkl, akachkach
* Owner: jwhitlock
* Contributors: akachkach, mccricardo, Sherry Shi


= Problem =
= Problem =


We use RabbitMQ as a pub/sub service which currently allows anyone to subscribe to any exchange via a common user account.  Some client applications use durable queues in case they crash; however, sometimes these queues are created by accident, and sometimes apps crash without admins noticing.  In these cases, the queues continue to grow without bound, which can eventually result in the RabbitMQ host running out of memory.  Our current solution is to have Nagios monitor the queues and send alerts when any queues exceed a certain number of unread or unacknowledged messages, at which point a RabbitMQ admin attempts to find the person responsible and/or delete the offending queue.
[[Auto-tools/Projects/Pulse|Pulse]] uses RabbitMQ as a pub/sub service which formerly allowed anyone to subscribe to any exchange via a common user account.  Some client applications use durable queues in case they crash; however, sometimes these queues are created by accident, and sometimes apps crash without admins noticing.  In these cases, the queues continue to grow without bound, which can eventually result in the RabbitMQ host running out of memory.  Our previous solution was to have Nagios monitor the queues and send alerts when any queues exceed a certain number of unread or unacknowledged messages, at which point a RabbitMQ admin attempted to find the person responsible and/or delete the offending queue.


= Goals & Considerations=
= Goals & Considerations=


We need an intelligent system to handle overgrowing queues.  The system should have some way to automatically alert the queue's owner, eventually deleting the queue if no action has been taken.
First, a couple definitions:
* A ''PulseGuardian user'' is a human user, identified by an email address.
* A ''Pulse user'' or ''RabbitMQ user'' is a user account in Pulse's RabbitMQ clusterIt is identified by a unique user ID.
* The ''max_queue_length'' of any queue is the maximum permitted number of unread and/or unacknowledged messages in that queue. This value is defined as either a single, static value for all queues, or is determined dynamically by some algorithm, possibly including system state (e.g. when a queue is deleted may depend on how many messages are currently in other queues). Currently, only a single, static value is supported.
* The ''warn_queue_length'' is a number between 0 and max_queue_length for a given queue at which point a queue-length warning is issued.  It may be a single, static value, or determined dynamically by some algorithm, as with max_queue_length.


A further improvement would be to automatically consume messages and write them to disk for later consumption, since this would at least free up memoryThis system would also need a limit to avoid consuming too much disk space, after which (with a further alert) the queue would be killed.  There would need to be a convenient way to consume archived messages.
The primary goal is management of
* Pulse users. A Pulse user is owned by one or more PulseGuardian users (currently, only one is supported).  PulseGuardian users can own multiple Pulse users and create new Pulse users and delete any Pulse users they own.
* Queues. A queue should be associated with a Pulse user (the queue's creator). A user can see the length of and delete any queues associated with Pulse users it owns. If a queue's length ever exceeds warn_queue_length, that is, moves from a value less than warn_queue_length to a value equal to or exceeding it, PulseGuardian will email a warning, with details on the offending queue, to the PulseGuardian user that owns the Pulse user that is associated with the queueSimilarly, if a queue's length moves from a value greater than warn_queue_length to a value below it, PulseGuardian will email a notification.  There may be some additional algorithm to prevent a large number of emails from being sent if a queue hovers around warn_queue_length or continually spikes above it. If the queue's length exceeds max_queue_length, the queue is deleted and an email is sent to the PulseGuardian user owning the Pulse user associated with the queue.
 
There are other RabbitMQ-management functions we can put into PulseGuardian as well, depending on their benefit to users, including extra notification email addresses and exchange management.


= Design and Approach =
= Design and Approach =


PulseGuardian will need to know who owns a given queue in order to attempt to contact its owner. Since we currently use the same user for all consumers, we have no way to know which person to contact.
Although the data currently in Pulse is not confidential, for accountability and to prevent possible abuse, PulseGuardian should be restricted to vouched Mozillians.  Logging in should be performed via Persona, authenticating with mozillians.org, to obtain an email address. A new user is created if there is none associated with the given email address.  After logging in, users can then create a RabbitMQ user account (see the [[Auto-tools/Projects/Pulse#Security_Model|Pulse security model]] for default permissions), which will be linked to the associated PulseGuardian user.  A password will need to be entered, but it should not be saved in PulseGuardian.  We may want to provide the ability to, or even require, a randomly generated password (a sort of API key).


Although the data currently in Pulse is not confidential, for accountability and to prevent possible abuse, PulseGuardian will be restricted to vouched Mozillians.  Logging in should be performed via Persona, authenticating with mozillians.org.  After logging in, users can then create a RabbitMQ user account (see the [[Auto-tools/Projects/Pulse#Security_Model|Pulse security model]] for default permissions), which will be linked to the associated Mozillians account.  Initially we can restrict users to one RabbitMQ account, though in the future we may want to allow multiple accounts to be created for different services.  If we later implement a message archive, the PulseGuardian web app should also provide a method (REST API) to download archived messages (see below).
The second part is a process that polls RabbitMQ, looking for queues that have grown above warn_queue_length.  If the queue belongs to a Pulse user associated with a PulseGuardian user account (ideally all should, but it is not absolutely required), a warning email is sent containing the queue name and current queue length.  If max_queue_length is reached, the queue is deleted, and another email is sent.  If the Pulse user is not associated with a PulseGuardian user, that is, it was created directly in RabbitMQ, or if the queue is not associated with a Pulse user, the queue is deleted without a user notification when max_queue_length is reached (no action is performed at warn_queue_length).
 
The second part is a process that polls RabbitMQ, looking for queues above a set length (WARN_QUEUE_SIZE).  If the queue belongs to a user associated with a PulseGuardian account, a warning email is sent to the email address registered with the Mozillians account containing the queue name and current queue length.  After a second threshold is reached (DEL_QUEUE_SIZE), the queue is deleted, and another email is sent.  If the username is not associated with a PulseGuardian account, that is, it was created directly in RabbitMQ, the queue is deleted without a user notification when DEL_QUEUE_SIZE is reached (no action is performed at WARN_QUEUE_SIZE).


Optionally, we can have admin email addresses that are also sent all notifications, including when there is no owner.
Optionally, we can have admin email addresses that are also sent all notifications, including when there is no owner.
Another optional feature is to add a threshold between WARN_QUEUE_SIZE and DEL_QUEUE_SIZE, call it ARCHIVE_QUEUE_SIZE, at which point PulseGuardian will start to consume messages from the queue and archive them to disk.  This is advantageous because RabbitMQ keeps all queues in memory, so one rogue queue can eventually take down RabbitMQ.  If the queue size falls below ARCHIVE_QUEUE_SIZE, presumably due to the client application resuming, no new messages will be archived unless ARCHIVE_QUEUE_SIZE is exceeded again.  When MAX_ARCHIVE_SIZE messages are archived, messages are no longer consumed by PulseGuardian and thus, unless archived messages are consumed by the client, the queue will continue to grow until DEL_QUEUE_SIZE is hit and the queue deleted, as above.
We'll have to think through this feature a bit to determine the implications of a client trying to consume while PulseGuardian is also consuming them (or trying to).


= Notes =
= Notes =


As we RabbitMQ's management plugin API doesn't give us the user who created a queue, we'll have to poll RabbitMQ to detect queues that aren't assigned to any user and assign each of them to the user of the consumer currently consuming them (reminder: in our pattern, we should only have one consumer maximum per queue)If we have not found an owner of a queue by the time it hits DEL_QUEUE_SIZE (because the consumer never stays connected long enough), the queue will be deleted with no warning (or, optionally, just a warning to the admin(s)).
The Pulse user associated with a queue can be determined by the queue's name, since they follow a set format enforced by RabbitMQ user permissions.  However given the coarse granularity of RabbitMQ permissions, technically a user can create a queue in the exchange namespace and vice versaWe could have PulseGuardian immediately delete these.


= Implementation =
= Implementation =
PulseGuardian uses Flask for the user management app and sqlalchemy + mysql to store user data.
PulseGuardian uses Flask for the user management app and SQLAlchemy + PostGreSQL to store user data.
 
Communication with RabbitMQ is done via the RabbitMQ management plugin's REST API.


Communication with RabbitMQ is done via the rabbitmq management plugin's REST API.
The production app is deployed via Heroku.

Latest revision as of 05:02, 8 September 2021

Status

PulseGuardian is available at https://pulseguardian.mozilla.org.

Code can be found at https://github.com/mozilla-services/pulseguardian.

Team

  • Owner: jwhitlock
  • Contributors: akachkach, mccricardo, Sherry Shi

Problem

Pulse uses RabbitMQ as a pub/sub service which formerly allowed anyone to subscribe to any exchange via a common user account. Some client applications use durable queues in case they crash; however, sometimes these queues are created by accident, and sometimes apps crash without admins noticing. In these cases, the queues continue to grow without bound, which can eventually result in the RabbitMQ host running out of memory. Our previous solution was to have Nagios monitor the queues and send alerts when any queues exceed a certain number of unread or unacknowledged messages, at which point a RabbitMQ admin attempted to find the person responsible and/or delete the offending queue.

Goals & Considerations

First, a couple definitions:

  • A PulseGuardian user is a human user, identified by an email address.
  • A Pulse user or RabbitMQ user is a user account in Pulse's RabbitMQ cluster. It is identified by a unique user ID.
  • The max_queue_length of any queue is the maximum permitted number of unread and/or unacknowledged messages in that queue. This value is defined as either a single, static value for all queues, or is determined dynamically by some algorithm, possibly including system state (e.g. when a queue is deleted may depend on how many messages are currently in other queues). Currently, only a single, static value is supported.
  • The warn_queue_length is a number between 0 and max_queue_length for a given queue at which point a queue-length warning is issued. It may be a single, static value, or determined dynamically by some algorithm, as with max_queue_length.

The primary goal is management of

  • Pulse users. A Pulse user is owned by one or more PulseGuardian users (currently, only one is supported). PulseGuardian users can own multiple Pulse users and create new Pulse users and delete any Pulse users they own.
  • Queues. A queue should be associated with a Pulse user (the queue's creator). A user can see the length of and delete any queues associated with Pulse users it owns. If a queue's length ever exceeds warn_queue_length, that is, moves from a value less than warn_queue_length to a value equal to or exceeding it, PulseGuardian will email a warning, with details on the offending queue, to the PulseGuardian user that owns the Pulse user that is associated with the queue. Similarly, if a queue's length moves from a value greater than warn_queue_length to a value below it, PulseGuardian will email a notification. There may be some additional algorithm to prevent a large number of emails from being sent if a queue hovers around warn_queue_length or continually spikes above it. If the queue's length exceeds max_queue_length, the queue is deleted and an email is sent to the PulseGuardian user owning the Pulse user associated with the queue.

There are other RabbitMQ-management functions we can put into PulseGuardian as well, depending on their benefit to users, including extra notification email addresses and exchange management.

Design and Approach

Although the data currently in Pulse is not confidential, for accountability and to prevent possible abuse, PulseGuardian should be restricted to vouched Mozillians. Logging in should be performed via Persona, authenticating with mozillians.org, to obtain an email address. A new user is created if there is none associated with the given email address. After logging in, users can then create a RabbitMQ user account (see the Pulse security model for default permissions), which will be linked to the associated PulseGuardian user. A password will need to be entered, but it should not be saved in PulseGuardian. We may want to provide the ability to, or even require, a randomly generated password (a sort of API key).

The second part is a process that polls RabbitMQ, looking for queues that have grown above warn_queue_length. If the queue belongs to a Pulse user associated with a PulseGuardian user account (ideally all should, but it is not absolutely required), a warning email is sent containing the queue name and current queue length. If max_queue_length is reached, the queue is deleted, and another email is sent. If the Pulse user is not associated with a PulseGuardian user, that is, it was created directly in RabbitMQ, or if the queue is not associated with a Pulse user, the queue is deleted without a user notification when max_queue_length is reached (no action is performed at warn_queue_length).

Optionally, we can have admin email addresses that are also sent all notifications, including when there is no owner.

Notes

The Pulse user associated with a queue can be determined by the queue's name, since they follow a set format enforced by RabbitMQ user permissions. However given the coarse granularity of RabbitMQ permissions, technically a user can create a queue in the exchange namespace and vice versa. We could have PulseGuardian immediately delete these.

Implementation

PulseGuardian uses Flask for the user management app and SQLAlchemy + PostGreSQL to store user data.

Communication with RabbitMQ is done via the RabbitMQ management plugin's REST API.

The production app is deployed via Heroku.