Socorro:OperationalMetrics

From MozillaWiki
Jump to: navigation, search

This is a list of the major components of Socorro 2.x. Each component has two sections:

  • Status is a place for metrics that are current point in time or most recent event. This is a place for metrics that would be useful for diagnosing a Nagios alert or investigating a possible problem. Ideally, having a level of interactivity with the user such as being able to log a comment about an error would provide a useful mechanism for collaborative maintenance.
  • Trend is a place for periodic snapshots of metrics (i.e. per minute) that will provide a longer term view into the health and performance of the system. Having markers in the trendlines for events such as config changes will provide the possibility to quickly correlate health changes in the system with code pushes or config changes.

Please add ideas for new metrics, or add comments about potential problems or changes for existing metrics.

Existing metrics sources

Components

Collector

Status

  • Number of nodes
  • Build/Release label
  • Config info
  • Total reports collected
  • Total throttled reports collected
  • List of node info
    • Uptime
    • Last failure
      • time
      • stacktrace
      • comments
    • Pct reports collected
    • Pct throttled reports collected

Trend

  • Number of nodes
  • Config change events
  • Code change events
  • Errors
  • Reports collected [4]
  • Throttled reports collected


Processor

Status

  • Number of nodes [5]
  • Build/Release label
  • Config info
  • Total reports processed
  • Total throttled reports processed [5]
  • List of node info
    • Uptime
    • Last failure
      • time
      • stacktrace
      • comments
    • Pct reports processed
    • Pct throttled reports processed

Trend

  • Number of nodes [5]
  • Config change events
  • Code change events
  • Errors
  • Reports processed
  • Throttled reports processed [5]
  • Reports processed with warnings
  • Report processing failures


DBFeeder

Status

  • Number of nodes
  • Build/Release label
  • Config info
  • Total reports processed
  • Total priority reports processed
  • Total throttled reports processed
  • List of node info
    • Uptime
    • Last failure
      • time
      • stacktrace
      • comments
    • Pct reports processed
    • Pct throttled reports processed

Trend

  • Number of nodes [5]
  • Config change events
  • Code change events
  • Errors
  • Reports processed
  • Throttled reports processed [5]
  • Reports processed with warnings
  • Report processing failures


Stackwalk Symbol Server

Status

  • Build/Release label
  • Config info
  • Uptime
  • Last failure
    • time
    • stacktrace
    • comments
  • Number of symbols loaded
  • Time since oldest symbol was used

Trend

  • Config change events
  • Code change events
  • Errors
  • Symbol loaded
  • Symbol dropped
  • Symbol cache hit
  • Symbol cache miss

HBase Cluster

Status

  • Cluster uptime
  • Number of nodes [1]
  • Number of regions [1]
  • Avg regions per node [1]
  • RegionServer with Min regions [1]
  • RegionServer with Max regions [1]
  • Youngest RegionServer uptime
  • Oldest RegionServer uptime
  • Build/Release label
  • Config info
  • Uptime
  • Last failure
    • time
    • stacktrace
    • comments

Trend

  • Number of nodes [2]
  • Number of regions [2]
  • Config change events
  • Code change events
  • Errors
  • RegionServer down event [2]
  • RegionServer up event [2]


Hadoop Cluster

Status

  • Cluster uptime
  • Number of nodes [6]
    • Live [6]
    • Dead [6]
    • Decommissioning [6]
  • Number of files [6]
  • Number of blocks [6]
    • Under-replicated blocks [6]
  • Heap size [6]
  • Capacity [6]
  • DFS Used [6]
  • Non-DFS Used [6]
  • DFS Remaining [6]
  • Build/Release label [6]
  • Config info
  • Uptime [6]
  • Last failure
    • time
    • stacktrace
    • comments

Trend

  • Number of nodes [6]
    • Live [6]
    • Dead [6]
    • Decommissioning [6]
  • Number of files [6]
  • Number of blocks [6]
    • Under-replicated blocks [6]
  • Heap size [6]
  • Capacity [6]
  • DFS Used [6]
  • Non-DFS Used [6]
  • DFS Remaining [6]
  • Config change events
  • Code change events
  • Errors


Zookeeper Cluster

Status

  • Cluster uptime
  • Number of members
  • Number of nodes
  • Build/Release label
  • Config info
  • Uptime
  • Last failure
    • time
    • stacktrace
    • comments

Trend

  • Number of members
  • Number of nodes
  • Number of regions
  • Config change events
  • Code change events
  • Errors


Postgres DB

Status

  • PostgreSQL master up
  • PostgreSQL master accepting connections
  • PostgreSQL standby up
  • PostgreSQL standby accepting connections
  • pgBouncer up
  • pgBouncer accepting connections
  • Replication running

Resource low points warnings:

  • 90% of connections used
  • 90% of disk space used
  • FS cache space below 40GB
  • IIT connections > 30
  • swapping
  • too many postgresql log files
  • too many archive log files

Trend

Slow query logging (pgfouine) -- not part of ganglia

  • Database size
  • TCBS size
  • Reports partition size
  • Replication delay
  • # of pooled connections
  • # of DB connections
  • length and number of IIT connections
  • Memory usage for: postgres processes, fs cache
  • I/O metrics
  • CPU metrics
  • query spill-to-disk
  • response time for a preset query or set of queries
  • database bloat

Middleware Layer

Status

Trend

UI

Status

  • Number of nodes (?)

Trend

  • Number of nodes (?)


Jobs

Status

  • List of jobs scheduled
    • name
    • time
    • description
    • owner
    • link to results
  • Recent failures
    • name
    • time
    • reason
    • logs
    • blame (i.e. cvs/svn blame?)
    • comments

Trend

  • Executions
  • Execution durations
  • Failure times