Closed Bug 891906 Opened 11 years ago Closed 11 years ago

mailing-list / newsgroup mirroring is broken

Categories

(Infrastructure & Operations :: Infrastructure: Mail, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: justdave)

References

Details

(Whiteboard: affected lists listed in comment 12)

https://groups.google.com/forum/#!topic/mozilla.dev.platform/UCio5fB4VJo was posted to dev.platform nearly 24 hours ago but has not appeared to subscribers who read dev-platform as a mailing list.

This is a blocker for communication within the project -- it prevents people from sending and receiving information and knowing who has received information, and needs to be fixed immediately (or at the very least announced as an unexpected outage to all@, etc.)
Assignee: server-ops → infra
Severity: blocker → normal
Component: Server Operations → Infrastructure: Mail
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → limed
David,

Thanks for bringing this to our notice. We'll take a look and let you know what we find out.
This appears to be specific to dev-platform: I continue to receive email from dev-planning and a few other lists correctly. As moderator of dev-platform I have checked the mailmain subscription info for myself and several others who are not receiving mail, and everyone is subscribed correctly.
That link goes to an entire thread, was there a specific message that didn't go through, or all messages in that thread or?
All messages in that thread.
I think it's all messages to dev-platform since 9-July.
OK, I've determined that the news gateway process is skipping this mailing list when checking for new newsgroup messages for some reason.  I haven't yet found any configuration differences between it and any of the other lists to determine why, and it's not logging any errors (it's just not doing it to begin with).

I'm continuing to play with it...
Assignee: infra → justdave
fwiw, the fact that it's outright skipping the group when checking for messages gives me high hopes that the missing messages will all go through at once when we get it fixed...
OK, this appears to be fixed now.  The root cause sickens me. :(

mozilla.community.hungary group has corrupted pointers on giganews' servers, and as best as I can tell, has since May 4th, 2013 (because that's when this appears to have broken).  Giganews is claiming there are 2.15 billion new messages in that newsgroup, and mailman was running out of memory trying to create a data structure to grab the headers for that many messages, causing it to crash, and failing to sync any newsgroups that came after it in the run order of the news gateway script.

There are a *LOT* of incoming messages from the news side getting pulled in and re-sent to the mailing lists right now.  The script is still running.  I'll post back here with a complete list of the affected mailing lists as soon as it's done (there were more than just this one).

The only way we would have caught this is monitoring mailman's crash logs. This has typically been an unpalatable thing to monitor, because it crashes a lot, and 99% of the crashes are completely innocuous things that we wouldn't actually care about, and would only cause us to start ignoring the alerts anyway.
FWIW, this was fixed by telling mailman to perform a one-time mass catchup on community-hungary, telling it to ignore those 2.15 billion pending new messages in that group.
It's still running (a lot of catching up to do).

In the meantime we are brainstorming on IRC about ways to detect if this starts failing again in the future.
OK, it's done.  Of the 214 total mailing lists we have gatewayed to news.mozilla.org, the following 74 lists were affected by this issue:

bugmasters
community-games
community-india
community-ireland
community-mexico
community-switzerland
community-tunisia
community-turkey
dev-apps-bugzilla
dev-apps-calendar
dev-apps-chatzilla
dev-apps-firefox
dev-apps-seamonkey
dev-apps-thunderbird
dev-b2g
dev-builds
dev-gaia
dev-identity
dev-js-sourcemap
dev-l10n
dev-l10n-de
dev-l10n-fa
dev-l10n-in
dev-l10n-new-locales
dev-l10n-pt-br
dev-l10n-sr
dev-l10n-ta
dev-l10n-vi
dev-l10n-web
dev-mdc
dev-mdc-es
dev-mdn
dev-mozilla-org
dev-pdf-js
dev-platform
dev-popcorn
dev-ports-os2
dev-privacy
dev-security-policy
dev-shumway
dev-static-analysis
dev-tech-crypto
dev-tech-dom
dev-tech-js-engine
dev-tech-js-engine-internals
dev-tech-layout
dev-tech-plugins
dev-tech-svg
dev-tech-xml
dev-tech-xpcom
dev-tech-xul
dev-tree-management
dev-webapi
dev-webapps
dev-webdev
general
governance
mozillians
privacy
reps-general
reps-mentors
reps-webdev
support-bugzilla
support-other
support-seamonkey
support-thunderbird
support-webtools
test
tools
tools-l10n
webapps
webmaker
webmaker-canada-bc
wishlist

The missed messages have all now been downloaded from the news server, and are spooling out to the mailing lists now.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: affected lists listed in comment 12
bug 892051 has been filed to track our progress on coming up with a way to monitor for this in the future.
Could you send an unexpected downtime notice explaining this, so that people understand what happened?  It's important both for the folks on the mailing list side receiving a flood of messages, and for the folks on the newsgroup side who need to understand that everything they've posted to these lists for the past few months has only been read by a part of the expected audience.
I filed a ticket with Giganews this afternoon about the mozilla.community.hungary newsgroup pointers.  They replied back that they were unable to resolve the issue without deleting and re-creating the newsgroup from scratch.  As best as I could tell prior to them doing so (from using an NNTP reader client) there was only one real message on that newsgroup anyway (and the pointer issue may have been preventing people from using it).
shyam/justdave: this issue was reported in bug 877134 on 29th May. dbaron reported it again 20 hours ago, and it is now fixed. For future reference, what special magic did he apply to get such prompt and excellent service, that all the people CCed on bug 877134 could use next time they have a discussion forum problem? :-)

Gerv
While I'm not them, I'd note two factors:

 (1) using a bug summary that was a reasonably accurate description of the actual problem.  Also see http://dbaron.org/log/20100426-bug-summary .

 (2) mentioning that the problem was an issue related to newsgroup -> mailing list mirroring rather than a purely mailing list issue (which was never explicitly mentioned in bug 877134, although I'd think it should have been considered)
You need to log in before you can comment on or make changes to this bug.