Closed Bug 1265003 Opened 8 years ago Closed 8 years ago

Run zookeeper/kafka on hgweb[11-14]

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

Details

We're currently running zookeeper/kafka on hgweb{1,9,10} and hgssh{2,3}. We should rebalance to hgweb[11-14] + hgssh3 (the current master).
This is done.

Unfortunately I caused an outage in the process. I followed the same steps I did when I added hgssh3 (see https://kafka.apache.org/documentation.html#operations for info about balancing replicas). I'm not sure what happened, but basically the cluster got in a really bad state. I attempted recovery. But after a few minutes I gave up, nuked /var/lib/{kafka,zookeeper} and started things from scratch. This is far from the ideal solution.

Looking at the Kafka docs, apparently there are some new-ish settings we can use to make recovery easier in the future. But hopefully we don't have to rebalance the cluster in a major way any time soon.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.