Closed Bug 666487 Opened 13 years ago Closed 13 years ago

DNS resolution in scl1 is slow

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: dustin, Assigned: arich)

Details

I'm saying this based on IRC conversations:
 - ganglia1.build.scl1 wasn't talking to NTP; solution: nscd!
 - ns1 and ns2 are forwarding to mradm01

This has some other fallout I'm working through, too..
I'm blaming bug 666486 on this - slavealloc slowness (as predicted by Amy..)
I suspect bug 570192 is related, too, but I think that the right fix for that bug is to fix the tests and merge that fix everywhere, so I'll keep my mouth shut.
The slowness issues that I noted yesterday no longer exist (or at least below the threshold I would notice at) today.  Dustin said he didn't have a way to test to see if other services have also recovered similarly.


In checking the config on the ns servers, the mradm01 forward is limited to the build vpn ip (which I'm not sure gets used anywhere in scl1, so I'm not sure what the point of this ACL is).

acl "build-vpn" {
    10.2.71.4/32;
};

view "vpn1.build" in {
    match-clients { build-vpn; };
    recursion yes;
...
    notify explicit;

    forwarders {
        10.2.74.123;
    };
...
};

view "default" in {
    match-clients { any; };
    recursion yes;
...

    forwarders {
        10.2.74.125;
        10.2.74.127;
    };
...
};

I still think it makes sense to have full slaves in each datacenter to avoid doing forwarded lookups across inter-datacenter links if possible.
As a severity yardstick: aside from slavealloc and the ganglia server, this did not seem to cause failures in production, although it likely caused a performance degradation that we just didn't look closely enough to see.
long shot question - could this be the source of the hg clone tools issue I saw happening earlier today?
I doubt it, but I can't find that bug now - what was the #?
Based on the errors you saw, it doesn't look at all like the same problem.  Instead of getting timeouts, you were getting corrupted file errors.

Based on the fact that I can no longer reproduce the behavior I was seeing (and no one else has provided a test that shows the problem still exists), I'm going to close this bug and address the overall dns architecture in scl1 in other project related bugs instead.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.