Auto-tools/Projects/Pulse/PulseGuardian: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= Status =
PulseGuardian is available at https://pulseguardian.mozilla.org.
Code can be found at https://github.com/mozilla-services/pulseguardian.
= Team =
= Team =


* mcote, dkl, ahmed
* Owner: jwhitlock
* Contributors: akachkach, mccricardo, Sherry Shi


= Problem =
= Problem =


We use RabbitMQ as a pub/sub service which currently allows anyone to subscribe to any queue via a common user account.  Some client applications use durable queues in case they crash; however, sometimes these queues are created by accident, and sometimes apps crash without admins noticing.  In these cases, the queues continue to grow without bound, which can eventually result in the RabbitMQ host running out of memory.  Our current solution is to have Nagios monitor the queues and send alerts when any queues exceed a certain number of unread or unacknowledged messages, at which point a RabbitMQ admin attempts to find the person responsible and/or delete the offending queue.
[[Auto-tools/Projects/Pulse|Pulse]] uses RabbitMQ as a pub/sub service which formerly allowed anyone to subscribe to any exchange via a common user account.  Some client applications use durable queues in case they crash; however, sometimes these queues are created by accident, and sometimes apps crash without admins noticing.  In these cases, the queues continue to grow without bound, which can eventually result in the RabbitMQ host running out of memory.  Our previous solution was to have Nagios monitor the queues and send alerts when any queues exceed a certain number of unread or unacknowledged messages, at which point a RabbitMQ admin attempted to find the person responsible and/or delete the offending queue.


= Goals & Considerations=
= Goals & Considerations=


We need an intelligent system to handle overgrowing queues.  The system should have some way to automatically alert the queue's owner, eventually deleting the queue if no action has been taken.
First, a couple definitions:
* A ''PulseGuardian user'' is a human user, identified by an email address.
* A ''Pulse user'' or ''RabbitMQ user'' is a user account in Pulse's RabbitMQ clusterIt is identified by a unique user ID.
* The ''max_queue_length'' of any queue is the maximum permitted number of unread and/or unacknowledged messages in that queue. This value is defined as either a single, static value for all queues, or is determined dynamically by some algorithm, possibly including system state (e.g. when a queue is deleted may depend on how many messages are currently in other queues). Currently, only a single, static value is supported.
* The ''warn_queue_length'' is a number between 0 and max_queue_length for a given queue at which point a queue-length warning is issued.  It may be a single, static value, or determined dynamically by some algorithm, as with max_queue_length.


A further improvement would be to automatically consume messages and write them to disk for later consumption, since this would at least free up memoryThis system would also need a limit to avoid consuming too much disk space, after which (with a further alert) the queue would be killed.  There would need to be a convenient way to consume archived messages.
The primary goal is management of
* Pulse users. A Pulse user is owned by one or more PulseGuardian users (currently, only one is supported).  PulseGuardian users can own multiple Pulse users and create new Pulse users and delete any Pulse users they own.
* Queues. A queue should be associated with a Pulse user (the queue's creator). A user can see the length of and delete any queues associated with Pulse users it owns. If a queue's length ever exceeds warn_queue_length, that is, moves from a value less than warn_queue_length to a value equal to or exceeding it, PulseGuardian will email a warning, with details on the offending queue, to the PulseGuardian user that owns the Pulse user that is associated with the queueSimilarly, if a queue's length moves from a value greater than warn_queue_length to a value below it, PulseGuardian will email a notification.  There may be some additional algorithm to prevent a large number of emails from being sent if a queue hovers around warn_queue_length or continually spikes above it. If the queue's length exceeds max_queue_length, the queue is deleted and an email is sent to the PulseGuardian user owning the Pulse user associated with the queue.


= Non-Goals =
There are other RabbitMQ-management functions we can put into PulseGuardian as well, depending on their benefit to users, including extra notification email addresses and exchange management.


= Design and Approach =


= Design and Approach =
Although the data currently in Pulse is not confidential, for accountability and to prevent possible abuse, PulseGuardian should be restricted to vouched Mozillians.  Logging in should be performed via Persona, authenticating with mozillians.org, to obtain an email address.  A new user is created if there is none associated with the given email address.  After logging in, users can then create a RabbitMQ user account (see the [[Auto-tools/Projects/Pulse#Security_Model|Pulse security model]] for default permissions), which will be linked to the associated PulseGuardian user.  A password will need to be entered, but it should not be saved in PulseGuardian.  We may want to provide the ability to, or even require, a randomly generated password (a sort of API key).


PulseGuardian will need to know who owns a given queue in order to attempt to contact its owner.  There are two good choices: the queue name and the usernameThe former is simple to set up, since it is entirely defined by the client. We could just use a convention, such as appname_email, where "appname" can be anything, and "email" should be a valid email addressHowever, since pulse is a public resource, this is open to abuse; anyone could provide anyone else's email, potentially deluging them with pulse messages.
The second part is a process that polls RabbitMQ, looking for queues that have grown above warn_queue_length.  If the queue belongs to a Pulse user associated with a PulseGuardian user account (ideally all should, but it is not absolutely required), a warning email is sent containing the queue name and current queue lengthIf max_queue_length is reached, the queue is deleted, and another email is sentIf the Pulse user is not associated with a PulseGuardian user, that is, it was created directly in RabbitMQ, or if the queue is not associated with a Pulse user, the queue is deleted without a user notification when max_queue_length is reached (no action is performed at warn_queue_length).


A more secure way is to provide email validation.  Thus we will need a simple web client that performs standard registration: accepts a username and password, emails a verification link/code, and creates the user in RabbitMQ when verified.  It should also provide a method to reset a user's password and to delete the user.  Finally, it should provide a method (REST API) to download archived messages (see below).
Optionally, we can have admin email addresses that are also sent all notifications, including when there is no owner.


The second part is a process that polls RabbitMQ, looking for queues above a set length (WARN_QUEUE_SIZE).  If the queue belongs to a user with a properly formatted username (i.e. an email address), a warning email is sent containing the queue name and current queue length.  After a second threshold is reached (DEL_QUEUE_SIZE), the queue is deleted, and another email is sent.  If the username is not a proper email address (e.g. the public user), the queue is silently deleted when DEL_QUEUE_SIZE is reached (no action is performed at WARN_QUEUE_SIZE).
= Notes =


Interaction with RabbitMQ should probably be via the management plugin's REST API.
The Pulse user associated with a queue can be determined by the queue's name, since they follow a set format enforced by RabbitMQ user permissions.  However given the coarse granularity of RabbitMQ permissions, technically a user can create a queue in the exchange namespace and vice versa.  We could have PulseGuardian immediately delete these.


We can also, optionally, add a threshold between WARN_QUEUE_SIZE and DEL_QUEUE_SIZE, call it ARCHIVE_QUEUE_SIZE, at which point PeerGuardian will start to consume messages from the queue and archive them to disk.  This is advantageous because RabbitMQ keeps all queues in memory, so one rogue queue can eventually take down RabbitMQ.  If the queue size falls below ARCHIVE_QUEUE_SIZE, presumably due to the client application resuming, no new messages will be archived unless ARCHIVE_QUEUE_SIZE is exceeded again.  When MAX_ARCHIVE_SIZE messages are archived, messages are no longer consumed by PeerGuardian and thus, unless archived messages are consumed by the client, the queue will continue to grow until DEL_QUEUE_SIZE is hit and the queue deleted, as above.
= Implementation =
PulseGuardian uses Flask for the user management app and SQLAlchemy + PostGreSQL to store user data.


We'll have to think through this feature a bit to determine the implications of a client trying to consume while PeerGuardian is also consuming them (or trying to).
Communication with RabbitMQ is done via the RabbitMQ management plugin's REST API.


= Implementation =
The production app is deployed via Heroku.

Latest revision as of 05:02, 8 September 2021

Status

PulseGuardian is available at https://pulseguardian.mozilla.org.

Code can be found at https://github.com/mozilla-services/pulseguardian.

Team

  • Owner: jwhitlock
  • Contributors: akachkach, mccricardo, Sherry Shi

Problem

Pulse uses RabbitMQ as a pub/sub service which formerly allowed anyone to subscribe to any exchange via a common user account. Some client applications use durable queues in case they crash; however, sometimes these queues are created by accident, and sometimes apps crash without admins noticing. In these cases, the queues continue to grow without bound, which can eventually result in the RabbitMQ host running out of memory. Our previous solution was to have Nagios monitor the queues and send alerts when any queues exceed a certain number of unread or unacknowledged messages, at which point a RabbitMQ admin attempted to find the person responsible and/or delete the offending queue.

Goals & Considerations

First, a couple definitions:

  • A PulseGuardian user is a human user, identified by an email address.
  • A Pulse user or RabbitMQ user is a user account in Pulse's RabbitMQ cluster. It is identified by a unique user ID.
  • The max_queue_length of any queue is the maximum permitted number of unread and/or unacknowledged messages in that queue. This value is defined as either a single, static value for all queues, or is determined dynamically by some algorithm, possibly including system state (e.g. when a queue is deleted may depend on how many messages are currently in other queues). Currently, only a single, static value is supported.
  • The warn_queue_length is a number between 0 and max_queue_length for a given queue at which point a queue-length warning is issued. It may be a single, static value, or determined dynamically by some algorithm, as with max_queue_length.

The primary goal is management of

  • Pulse users. A Pulse user is owned by one or more PulseGuardian users (currently, only one is supported). PulseGuardian users can own multiple Pulse users and create new Pulse users and delete any Pulse users they own.
  • Queues. A queue should be associated with a Pulse user (the queue's creator). A user can see the length of and delete any queues associated with Pulse users it owns. If a queue's length ever exceeds warn_queue_length, that is, moves from a value less than warn_queue_length to a value equal to or exceeding it, PulseGuardian will email a warning, with details on the offending queue, to the PulseGuardian user that owns the Pulse user that is associated with the queue. Similarly, if a queue's length moves from a value greater than warn_queue_length to a value below it, PulseGuardian will email a notification. There may be some additional algorithm to prevent a large number of emails from being sent if a queue hovers around warn_queue_length or continually spikes above it. If the queue's length exceeds max_queue_length, the queue is deleted and an email is sent to the PulseGuardian user owning the Pulse user associated with the queue.

There are other RabbitMQ-management functions we can put into PulseGuardian as well, depending on their benefit to users, including extra notification email addresses and exchange management.

Design and Approach

Although the data currently in Pulse is not confidential, for accountability and to prevent possible abuse, PulseGuardian should be restricted to vouched Mozillians. Logging in should be performed via Persona, authenticating with mozillians.org, to obtain an email address. A new user is created if there is none associated with the given email address. After logging in, users can then create a RabbitMQ user account (see the Pulse security model for default permissions), which will be linked to the associated PulseGuardian user. A password will need to be entered, but it should not be saved in PulseGuardian. We may want to provide the ability to, or even require, a randomly generated password (a sort of API key).

The second part is a process that polls RabbitMQ, looking for queues that have grown above warn_queue_length. If the queue belongs to a Pulse user associated with a PulseGuardian user account (ideally all should, but it is not absolutely required), a warning email is sent containing the queue name and current queue length. If max_queue_length is reached, the queue is deleted, and another email is sent. If the Pulse user is not associated with a PulseGuardian user, that is, it was created directly in RabbitMQ, or if the queue is not associated with a Pulse user, the queue is deleted without a user notification when max_queue_length is reached (no action is performed at warn_queue_length).

Optionally, we can have admin email addresses that are also sent all notifications, including when there is no owner.

Notes

The Pulse user associated with a queue can be determined by the queue's name, since they follow a set format enforced by RabbitMQ user permissions. However given the coarse granularity of RabbitMQ permissions, technically a user can create a queue in the exchange namespace and vice versa. We could have PulseGuardian immediately delete these.

Implementation

PulseGuardian uses Flask for the user management app and SQLAlchemy + PostGreSQL to store user data.

Communication with RabbitMQ is done via the RabbitMQ management plugin's REST API.

The production app is deployed via Heroku.