Closed Bug 773981 Opened 12 years ago Closed 12 years ago

BIND crashes with assertion failure

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

x86_64
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dumitru, Assigned: dumitru)

Details

(Whiteboard: [buildduty][outage])

named on ns2.private.scl3 hit a bug last night:

14-Jul-2012 02:02:25.131 general: critical: rbtdb.c:1619: INSIST(!((void *)((node)->deadlink.prev) != (void *)(-1))) failed
14-Jul-2012 02:02:25.131 general: critical: exiting (due to assertion failure)

bind version is 9.8.2, release 0.10.rc1.el6, OS RHEL 6.3.

First reported on a ISC mailing list:
https://lists.isc.org/pipermail/bind-users/2012-February/086793.html

ISC patched it in bind 9.8.2rc2, per https://lists.isc.org/pipermail/bind-announce/2012-March/000766.html [RT #27738]

Red Hat didn't ship an update yet, although it's been 4 months since ISC patched this:

https://bugzilla.redhat.com/show_bug.cgi?id=837165
named crashed again and was restarted.

Just for clarification, running OS version is RHEL 6.2 (x86_64)
(In reply to Adrian Fernandez [:Aj] from comment #1)

> Just for clarification, running OS version is RHEL 6.2 (x86_64)

Yeah, OS doesn't matter too much, all RHEL 6 flavors that use that named package are affected.
this has happened again in scl1 and has caused some build jobs to fail
Whiteboard: [buildduty][outage]
Just happened on admin1a.infra.scl1 too:

17-Jul-2012 15:25:46.262 general: critical: rbtdb.c:1619: INSIST(!((void *)((node)->deadlink.prev) != (void *)(-1))) failed
17-Jul-2012 15:25:46.262 general: critical: exiting (due to assertion failure)

Is this worth building a patched RPM?
for buildduty's benefit:

first nagios alerts happened at 1528:

[15:28]  <nagios-releng-scl1> [09] buildbot-master06.build.scl1:MySQL connectivity is WARNING: Unknown MySQL server host buildbot-rw-vip.db.scl3.mozilla.com (1)
[15:30]  <nagios-releng-scl1> [10] buildbot-master15.build.scl1:MySQL connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.

and continued until 1546:

[15:46]  <nagios-releng-scl1> buildbot-master21.build.scl1 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms
named restarted on ns2.private.scl3 (again).

However, besides the known bug, seems odd that this is only occurring on ns2 and not ns1 as well.
What do you think about either adding monitoring for named processes to nagios, or (better) monitoring the named daemon in keepalived so that the VIP fails over when this occurs?  Apologies if I'm not being helpful..
I filed a case with Red Hat to address this.
http://rhn.redhat.com/errata/RHBA-2012-1107.html
Assignee: server-ops-infra → dgherman
Seems that puppet upgraded this across our infra.
Verified a couple of hosts and they have the new package.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.