Closed Bug 792079 Opened 13 years ago Closed 13 years ago

replication failing for ad.mozilla.com

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: q)

References

Details

Error log messages on dc3 regarding being unable to communicate with dc1 and dc2.
Assignee: server-ops-releng → mlarrain
12:17 < MaRu> dcdiag on dc1 everything passes 12:21 < MaRu> from dc3 everything works for replication BUT it doesn't see dc6/7 12:21 < MaRu> just dc1,2,9 12:22 < MaRu> it doesn't see the ones in SCL1 12:22 < MaRu> sounds like it could be a firewall issue 12:24 < MaRu> and I stand on that ground because the rest of the replication is working
If we get time, let's use this next week as an example of how to diagnose and repair replication. I'll take notes and put 'em in mana, then you can use those notes in an oncall meeting.
This needs to be fixed and documented early next week.
Severity: normal → critical
Well, this seems to still be the case, and is even a little bit worse now. C:\Users\dmitchell>repadmin /replsum Replication Summary Start Time: 2012-11-26 19:34:02 Beginning data collection for replication summary, this may take awhile: .......... Source DSA largest delta fails/total %% error DC1 43m:20s 0 / 17 0 DC2 44m:57s 0 / 14 0 DC3 44m:57s 0 / 11 0 DC6 43m:19s 0 / 14 0 DC7 44m:57s 0 / 14 0 DC8 44m:57s 0 / 15 0 DC9 17d.00h:49m:12s 4 / 11 36 (8524) The DSA operation is unable to proceed because of a DNS lookup failure. Destination DSA largest delta fails/total %% error DC1 44m:58s 0 / 17 0 DC2 43m:21s 0 / 14 0 DC3 30m:30s 0 / 11 0 DC6 17d.00h:49m:13s 4 / 18 22 (8524) The DSA operation is unable to proceed because of a DNS lookup failure. DC7 42m:09s 0 / 14 0 DC8 42m:14s 0 / 11 0 DC9 41m:13s 0 / 11 0
We failed the releng.ad.mozilla.com domain-specific FSMO roles over to DC9 (they were on DC6). Now they won't fail back. I suspect this is related to the above error.
Summary: DFS replication failing for ad.mozilla.com → replication failing for ad.mozilla.com
Here's the failing to fall back: fsmo maintenance: transfer pdc ldap_modify_sW error 0x34(52 (Unavailable). Ldap extended error message is 000020AF: SvcErr: DSID-03210581, problem 5002 (UN AVAILABLE), data 8524 Win32 error returned is 0x20af(The requested FSMO operation failed. The current FSMO holder could not be contacted.) ) Depending on the error code this may indicate a connection, ldap, or role transfer error. Server "dc6" knows about 5 roles Schema - CN=NTDS Settings,CN=DC1,CN=Servers,CN=Default-First-Site-Name,CN=Sites, CN=Configuration,DC=ad,DC=mozilla,DC=com Naming Master - CN=NTDS Settings,CN=DC1,CN=Servers,CN=Default-First-Site-Name,CN =Sites,CN=Configuration,DC=ad,DC=mozilla,DC=com PDC - CN=NTDS Settings,CN=DC9,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN= Configuration,DC=ad,DC=mozilla,DC=com RID - CN=NTDS Settings,CN=DC9,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN= Configuration,DC=ad,DC=mozilla,DC=com Infrastructure - CN=NTDS Settings,CN=DC9,CN=Servers,CN=Default-First-Site-Name,C N=Sites,CN=Configuration,DC=ad,DC=mozilla,DC=com fsmo maintenance: c
(the above is after I fixed DNS to point the PDC role toward DC9: (moz)dustin@euclid ~/code/moz/t/dnsconfig $ svn diff -c 53222 Index: zones/mozilla.com/ad/SOA =================================================================== --- zones/mozilla.com/ad/SOA (revision 53221) +++ zones/mozilla.com/ad/SOA (revision 53222) @@ -1,7 +1,7 @@ $TTL 300 @ IN SOA ns.mozilla.org. noc.mozilla.com. ( - 2012092502 + 2012112601 10800 3600 604800 Index: zones/mozilla.com/ad/private =================================================================== --- zones/mozilla.com/ad/private (revision 53221) +++ zones/mozilla.com/ad/private (revision 53222) @@ -110,7 +110,6 @@ releng.ad.mozilla.com. 600 IN A 10.22.69.18 _ldap._tcp.releng.ad.mozilla.com. 600 IN SRV 0 100 389 dc6.releng.ad.mozilla.com. _ldap._tcp.Default-First-Site-Name._sites.releng.ad.mozilla.com. 600 IN SRV 0 100 389 dc6.releng.ad.mozilla.com. -_ldap._tcp.pdc._msdcs.releng.ad.mozilla.com. 600 IN SRV 0 100 389 dc6.releng.ad.mozilla.com. _ldap._tcp.c98bae82-3d1c-42d8-b8a7-668e672d88a6.domains._msdcs.ad.mozilla.com. 600 IN SRV 0 100 389 dc6.releng.ad.mozilla.com. d7b21ce7-b91d-492f-952e-2a34682ed9a8._msdcs.ad.mozilla.com. 600 IN CNAME dc6.releng.ad.mozilla.com. _kerberos._tcp.dc._msdcs.releng.ad.mozilla.com. 600 IN SRV 0 100 88 dc6.releng.ad.mozilla.com. @@ -205,6 +204,7 @@ _kerberos._udp.releng.ad.mozilla.com. 600 IN SRV 0 100 88 DC9.releng.ad.mozilla.com. _kpasswd._tcp.releng.ad.mozilla.com. 600 IN SRV 0 100 464 DC9.releng.ad.mozilla.com. _kpasswd._udp.releng.ad.mozilla.com. 600 IN SRV 0 100 464 DC9.releng.ad.mozilla.com. +_ldap._tcp.pdc._msdcs.releng.ad.mozilla.com. 600 IN SRV 0 100 389 DC9.releng.ad.mozilla.com. ; ; non-dc services
Matt, can you please start an etherpad with all of the research you've done on this and the solutions you've tried so we can pick up on that without much overlap at a later point?
Assignee: mlarrain → dustin
I suspect that this is also preventing me from adding a DFS root target on DC6. So, I'll need to do that when the replication is fixed.
I had similar problems trying to create the 'apps' link in \\ad\data. If DC2 is one of the root targets, creation fails. If I remove DC2, creation works.
Blocks: 827868
Assignee: dustin → server-ops-releng
Blocks: 798590
Assignee: server-ops-releng → qfortier
All working now. Notes to follow shortly C:\Windows\system32>repadmin /replsum Replication Summary Start Time: 2013-01-24 17:02:47 Beginning data collection for replication summary, this may take awhile: .......... Source DSA largest delta fails/total %% error DC1 11m:23s 0 / 17 0 DC2 12m:06s 0 / 14 0 DC3 11m:21s 0 / 11 0 DC6 06m:31s 0 / 14 0 DC7 12m:06s 0 / 14 0 DC8 11m:21s 0 / 11 0 DC9 12m:06s 0 / 11 0 Destination DSA largest delta fails/total %% error DC1 11m:21s 0 / 17 0 DC2 06m:31s 0 / 14 0 DC3 11m:23s 0 / 11 0 DC6 12m:06s 0 / 14 0 DC7 06m:26s 0 / 14 0 DC8 11m:08s 0 / 11 0 DC9 05m:37s 0 / 11 0
Issue: DFS failing to replicate on DC1 - 3 Root cause: Issues on DC2 DFS setup, time drift on DC2, and DC1 DNS settings Symptoms: DFS error in event viewer, dfsdiag (cli tool), repadmin (cli tool), and files and accounts not appearing on DC2. Started checking on DC2 and I was unable to login with my new ad credentials. The admin account was reporting that the system time was off. After resetting the time I was able to login. After logging in I noticed DFS errors in the event viewer. I removed the DFS namespace from DC2 and rebuilt it the dc started replicating. I check all three servers in the event viewer and via unc path (\\dc1.mozilla.com\data\apps, etc). Things looked clean for the ad.mozilla.com domain.
Issue: DFS failing to replicate on DC6 - 9 Root cause: DNS issues on DC6 and via forwarders DC1 Symptoms: DFS error in event viewer, dfsdiag (cli tool), repadmin (cli tool), and files and accounts not appearing on DC8 or DC9. Started checking on DC6 and dfsdiag /TestDcs reported errors on DC8 and DC9. I was unable to ping either DC8 or DC9 from the command line. However, nslookup directly against ns1 and ns2 worked and other machines resolved fine. After some IRC in channel discussion with ( and helpful troubleshotting from) arr and dustin I noticed that ipconfig showed that the IPv6 had a hard coded domain name server of ::1 which was referring to localhost. The local dns server was forwarding DC1 which referred to itself for SOA of releng.ad.mozilla.com and DC1 had no entries for DC8 or DC9 causing replication to fail. I set the the DNS settings in the "adapter settings" control panel to automatic in the IPV6 settings. I then forced sync on all DCs and reran repadmin and dfsdiag with no errors. After a few hours the event viewer showed no new errors and informational messages about functioning replication.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.