Closed Bug 713366 Opened 13 years ago Closed 13 years ago

Some MDN urls not working through pm-dekiwiki02 and pm-dekiwiki03

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cshields, Assigned: nmaul)

References

()

Details

I'm sure there are other problematic pages, but one good test has been https://developer.mozilla.org/en/Mozilla_CSS_support_chart

With 02 and 03 in the pool this page is unreachable.  However, oddly enough it works through 01, and the rest of MDN appears to work fine on 02 and 03.

To get through the weekend I've set 01 as authoritative for MDN, and 02/03 are failover only.

Not sure if this is a query timing out or what, but with MDN in a working state with regard to this issue and already burning people through one holiday day, I'm happy to leave this until Tuesday.
I've turned 2 and 3 back on, and set 1 to draining. This seems to be working again.

I'm guessing the problem is once again tied to Bugzilla<->MDN interaction (Bug 712237)... this page does use the {{bugzilla}} template. I'm guessing it worked at one point only because for some reason 01 was working then but 02 and 03 weren't.

Right now none of them are (MDN folks changed the template to not actually call Bugzilla), so calling this fixed is a bit disingenuous. Once that other bug is fixed, I suspect this will be fixed (for real) as well.


The other possibility is that there is something wrong with the Lucene indexing service running on 01, that 02/03 query. I'm skeptical of this because it's working now and that hasn't changed, and other pages seem to work fine also. On top of that, if Lucene was somehow broken then 01 should be problematic as well- it simply queries itself. The only way this could be at fault would be if the nodes could somehow no longer talk to each other, which seems pretty unlikely. None of our recent issues have affected SJC1 nodes directly... only PHX1 nodes (of which MDN has zero).

Given that we have a known issue in MDN->Bugzilla linkage, and this page uses Bugzilla queries, I think our best bet is to eliminate that first.
Depends on: 712237
02 and 03 were disabled again earlier today. Leaving them that way for now.

The MDN-Bugzilla issue is likely resolved now. In my testing, for some reason pm-dekiwiki03 was affected much more severely than the other two, and that may have contributed to problems with it. I'm not sure if or how this translated to problems on 02, as in my testing it was much less affected.

Tomorrow we'll try them again, so that if they fail I'll hopefully be around to catch it myself.
Are 02 and 03 still having trouble talking to bugzilla, or is their problem something else? I'm concerned about that, of course.

Also, simply tweaking the bug template not to talk to bugzilla isn't a great long-term solution. Is the actual connectivity problem fixed, or is this "fixed" only because we've temporarily disabled the connection attempts?
All 3 are talking to Bugzilla fine now. It's okay to revert the template back to normal if you'd like.

I'm going to implement a deki check like django has before enabling 02 and 03 back in the LB.
So can anyone explain what caused devmo to stop being able to talk to bugzilla in the first place? :)
Sure, that's easy. I think it was covered in other bugs or IRC, but in any case:

When our most recent Zeus trouble started (early December, IIRC), we moved Bugzilla over to SJC. This was easy, it's already set up for that. This didn't cause you any problems... Bugzilla was originally in SJC1, and all of those rules/routes were still in effect.

While Bugzilla was in SJC1, we moved the load balancers in PHX1 out from behind the PHX1 firewall, in an attempt to improve throughput to them... we switched to local iptables rules instead of a dedicated appliance. When this happened, the rules allowing MDN to access Bugzilla "fell off". There were certain routing things being done by the firewall that now needed to be done by the core routers. Those were overlooked and didn't get implemented. However, with Bugzilla still in SJC1, this didn't result in any immediate problems.

When Bugzilla was migrated back to PHX1, the connection stopped working. This affected more than just MDN, as at least one other app made use of the same rules.



I still plan on enabling 02 and 03 again today, but haven't done it yet.
2 and 3 are turned back up now. The deki check is in place. Closing this bug in favor of Bug 714048.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.