Flume ElasticSearch WOO Maintenance Page

From MozillaWiki
Jump to: navigation, search

Flume WOO project facilitates realtime buildbot log ingestion inside HDFS/Hive and ElasticSearch via Flume. This page describes the different machines, installed software, and steps to restart services.

ElasticSearch cluster:
elasticsearch1.metrics.sjc1.mozilla.com (master) elasticsearch2.metrics.sjc1.mozilla.com (slave) elasticsearch3.metrics.sjc1.mozilla.com (slave)

Symptom: Nagios ElasticSearch alert indicates one (or many) machines are down.
Fix: Login to the relevant machine/s. Kill all running elasticsearch processes (ps ax|grep elasticsearch)
If problem persists, we will need to stop and start elasticsearch service for the entire cluster. Login to each machine and kill all running elasticsearch processes.
Restart the services in following order (elasticsearch1, elasticsearch2, elasticsearch3)
Restart command: /usr/lib/es/bin/elasticsearch

Please email aphadke@mozilla.com, desinspanjer@mozilla.com if problem persists.


Flume cluster:
elasticsearch3.metrics.sjc1.mozilla.com (master)
elasticsearch4.metrics.sjc1.mozilla.com (node-collector)
elasticsearch5.metrics.sjc1.mozilla.com (node-agent)

Symptom: Nagios Flume alert indicates a given machine is down.
Hostname: elasticsearch4.metrics.sjc1.mozilla.com
Stop Flume
: /usr/lib/flume/bin/flume-daemon.sh stop
Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid
Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node_nowatch -n elasticsearch4.metrics.sjc1.mozilla.com

Symptom: Nagios Flume alert indicates a given machine is down.
Hostname: elasticsearch5.metrics.sjc1.mozilla.com
Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid
Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node -n elasticsearch5.metrics.sjc1.mozilla.com


Symptom: Nagios Flume alert indicates a given machine is down.
Hostname: elasticsearch3.metrics.sjc1.mozilla.com
Resolution: Please email aphadke@mozilla.com (213-509-0575) or deinspanjer@mozilla.com. While we can restart Flume master, a master going down might indicate deeper problems. Given the infancy nature of flume, its best to investigate further before just restarting it.
Restart instructions for reference.
Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid
Start Flume: /usr/lib/flume/bin/flume-daemon.sh start master