- Triage bugs
HBase server load after the 1.7 push was significantly higher than pre 1.7. High enough that we were getting region timeouts a few times a week which would cause processors to fail. There are two main reasons for high load on the cluster:
- Too many regions per server -- This is due in part to the initial settings we had in production prior to 1.7. We have improved on these settings and we are now creating larger regions for new data, but we still have a lot of old data that must be migrated / manipulated to bring the total count of regions down. While it is certainly possible for HBase to handle this size of data, we don't have enough nodes currently to make it comfortable.
- The WAL (write-ahead log) HLog files used to store new data before it is flushed out to the cluster and replicated were rolling over too quickly. We had a setting of 2MB for these files, and this means that we were frequently rolling the logs several times a second. On Monday, we restarted the cluster with a change to increase the HLog file size to 16MB. This had a huge beneficial impact on the load of the system. It does slightly increase the chance of losing a small number of crash reports (worst case estimate would be around 400 raw reports) in the event of a region server crash. This problem will go away with the next major version of HBase which provides better append support.
On Tuesday, we brought the cluster down again to upgrade the Hadoop and HBase libraries to the official latest stable versions. This also restored our ability to stop a single region server node if needed. Unfortunately, there were two regressions in the updated configuration files, and one missed step in the upgrade process. This increased the amount of downtime from an estimated 30 minutes to about 50. It also resulted in the failure of a couple of region servers this morning. The missing config setting was updated and we did a rolling restart of the cluster. There was a very short interruption in processing (two or three minutes) when one of the region servers restarted.
The next big step is to reduce the total number of regions in the cluster by migrating/merging our old data into the new format. We will be working on that this week and testing to have a method that has minimal impact on the live system.
We have a bug open to improve our config management process to prevent the regression of settings impacting future upgrades.
We are also updating our operation notes wiki pages to provide IT with better descriptions of problems and actions to correct them and also who to contact in case an escalation to the Metrics team is needed.
- As we all now know, this was a git lock issue
- We will have a separate postmortem - Friday?