This is a list of the major components of Socorro 2.x. Each component has two sections:

Status is a place for metrics that are current point in time or most recent event. This is a place for metrics that would be useful for diagnosing a Nagios alert or investigating a possible problem. Ideally, having a level of interactivity with the user such as being able to log a comment about an error would provide a useful mechanism for collaborative maintenance.
Trend is a place for periodic snapshots of metrics (i.e. per minute) that will provide a longer term view into the health and performance of the system. Having markers in the trendlines for events such as config changes will provide the possibility to quickly correlate health changes in the system with code pushes or config changes.

Please add ideas for new metrics, or add comments about potential problems or changes for existing metrics.

Existing metrics sources

1 HBase Master Status UI
2 Production Hadoop Cluster Ganglia UI
3 Production Hadoop Cluster Ganglia UI Socorro Stats scroll to very bottom
4 Metrics Dashboard - raw crash submission
5 Crash Stats Status
6 Hadoop DFS Health

Components

Collector

Status

Number of nodes
Build/Release label
Config info
Total reports collected
Total throttled reports collected
List of node info
- Uptime
- Last failure
  - time
  - stacktrace
  - comments
- Pct reports collected
- Pct throttled reports collected

Trend

Number of nodes
Config change events
Code change events
Errors
Reports collected [4]
Throttled reports collected

Processor

Status

Number of nodes [5]
Build/Release label
Config info
Total reports processed
Total throttled reports processed [5]
List of node info
- Uptime
- Last failure
  - time
  - stacktrace
  - comments
- Pct reports processed
- Pct throttled reports processed

Trend

Number of nodes [5]
Config change events
Code change events
Errors
Reports processed
Throttled reports processed [5]
Reports processed with warnings
Report processing failures

DBFeeder

Status

Number of nodes
Build/Release label
Config info
Total reports processed
Total priority reports processed
Total throttled reports processed
List of node info
- Uptime
- Last failure
  - time
  - stacktrace
  - comments
- Pct reports processed
- Pct throttled reports processed

Trend

Number of nodes [5]
Config change events
Code change events
Errors
Reports processed
Throttled reports processed [5]
Reports processed with warnings
Report processing failures

Stackwalk Symbol Server

Status

Build/Release label
Config info
Uptime
Last failure
- time
- stacktrace
- comments
Number of symbols loaded
Time since oldest symbol was used

Trend

Config change events
Code change events
Errors
Symbol loaded
Symbol dropped
Symbol cache hit
Symbol cache miss

HBase Cluster

Status

Cluster uptime
Number of nodes [1]
Number of regions [1]
Avg regions per node [1]
RegionServer with Min regions [1]
RegionServer with Max regions [1]
Youngest RegionServer uptime
Oldest RegionServer uptime
Build/Release label
Config info
Uptime
Last failure
- time
- stacktrace
- comments

Trend

Number of nodes [2]
Number of regions [2]
Config change events
Code change events
Errors
RegionServer down event [2]
RegionServer up event [2]

Hadoop Cluster

Status

Cluster uptime
Number of nodes [6]
- Live [6]
- Dead [6]
- Decommissioning [6]
Number of files [6]
Number of blocks [6]
- Under-replicated blocks [6]
Heap size [6]
Capacity [6]
DFS Used [6]
Non-DFS Used [6]
DFS Remaining [6]
Build/Release label [6]
Config info
Uptime [6]
Last failure
- time
- stacktrace
- comments

Trend

Number of nodes [6]
- Live [6]
- Dead [6]
- Decommissioning [6]
Number of files [6]
Number of blocks [6]
- Under-replicated blocks [6]
Heap size [6]
Capacity [6]
DFS Used [6]
Non-DFS Used [6]
DFS Remaining [6]
Config change events
Code change events
Errors

Zookeeper Cluster

Status

Cluster uptime
Number of members
Number of nodes
Build/Release label
Config info
Uptime
Last failure
- time
- stacktrace
- comments

Trend

Number of members
Number of nodes
Number of regions
Config change events
Code change events
Errors

Postgres DB

Status

PostgreSQL master up
PostgreSQL master accepting connections
PostgreSQL standby up
PostgreSQL standby accepting connections
pgBouncer up
pgBouncer accepting connections
Replication running

Resource low points warnings:

90% of connections used
90% of disk space used
FS cache space below 40GB
IIT connections > 30
swapping
too many postgresql log files
too many archive log files

Trend

Slow query logging (pgfouine) -- not part of ganglia

Database size
TCBS size
Reports partition size
Replication delay
# of pooled connections
# of DB connections
length and number of IIT connections
Memory usage for: postgres processes, fs cache
I/O metrics
CPU metrics
query spill-to-disk
response time for a preset query or set of queries
database bloat

Middleware Layer

Status

Trend

UI

Status

Number of nodes (?)

Trend

Number of nodes (?)

Jobs

Status

List of jobs scheduled
- name
- time
- description
- owner
- link to results
Recent failures
- name
- time
- reason
- logs
- blame (i.e. cvs/svn blame?)
- comments

Trend

Executions
Execution durations
Failure times

Socorro:OperationalMetrics

Contents

Existing metrics sources

Components

Collector

Status

Trend

Processor

Status

Trend

DBFeeder

Status

Trend

Stackwalk Symbol Server

Status

Trend

HBase Cluster

Status

Trend

Hadoop Cluster

Status

Trend

Zookeeper Cluster

Status

Trend

Postgres DB

Status

Trend

Middleware Layer

Status

Trend

UI

Status

Trend

Jobs

Status

Trend

Navigation menu

Search