Closed
Bug 706854
Opened 14 years ago
Closed 14 years ago
email down for many employees
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: akeybl, Assigned: justdave)
Details
(Whiteboard: [Zimbra #00080041])
+++ This bug was initially created as a clone of Bug #701441 +++
Sending/receiving email is again down for at least a few employees (jbonacci, myself).
The last email I recieved was at 8:51 AM.
Comment 1•14 years ago
|
||
Work in progress. Zimbra is backlogged by heavy user activity and we're working on getting things back up ASAP.
Assignee: server-ops → justdave
| Assignee | ||
Updated•14 years ago
|
Whiteboard: [Zimbra #00080041]
| Assignee | ||
Comment 2•14 years ago
|
||
Current status: it *appears* that the initial slowdown was caused by multiple people mass-deleting a few thousand messages at once, cleaning out mailboxes or something. Inbound mail was blocked up while these deletes were happening, and for the last hour or so it's mostly been buried under mail processing trying to handle the backed up inbound email.
At least part of the problem is caused by us running such an old version of Zimbra. There are issues handling certain of these operations in the old version we're running that have been fixed in newer versions. There's an upgrade in the works, it can't happen soon enough :|
Comment 3•14 years ago
|
||
Any updates here? It's been several hours and Zimbra is still insanely slow...
| Assignee | ||
Comment 4•14 years ago
|
||
The mail queue has been catching up slowly. It's still a few hours behind, but is making progress. Just made more changes on Zimbra's tech support's advice, kicking the server to pick them up.
| Assignee | ||
Comment 5•14 years ago
|
||
Reposting here what I just put on the company forum:
We had a large number of "normal" activities happen in a short time period this morning that all converged on each other to bury our Zimbra server this morning. One person did a "run filters on folder" that resulted in 60,000 messages being moved to another folder, while two other people deleted 1800 to 3000 emails from a mailbox. These are normal activities that in general should not bother the system, but we're running an older version of Zimbra which has a few bugs in handling these situations in a performant manner, and the convergence of multiple of these operations happening at once just complicated things.
While the mail store was blocked up handling those message move/delete operations, inbound mail was massively slowed down, and the inbound mail queue backed up to about 14,000 messages before things got moving again. After all of the moves/delete got cleared out, it's been processing the queue, but it's been very slow going, and 4 hours after the fact the queue has only gotten down to 12,000.
Our mail server is already very near capacity, and we already have a second server to add onto it in the works, which will hopefully be ready to go next week sometime. As close to capacity as we are, this morning's convergence of unfortunate situations just all piled up on it and was enough to push it over.
Also planning in the very near future is upgrading to a newer version of Zimbra, which has less of these performance issues with queue handling. This upgrade has been stalled for a while waiting on an additional storage blade to arrive for our staging server, because we want to make sure it works correctly before upgrading production, and the staging server didn't have enough disk space to duplicate the current production environment. The storage blade in question arrived yesterday, and got racked this afternoon, and we should have a trial upgrade done this weekend, after which we'll be able to schedule one in production.
| Assignee | ||
Comment 6•14 years ago
|
||
Additional things we're investigating:
- Two of the drives in the primary disk array on the mail server (which contains 12 drives) appear to be underperforming. They haven't failed yet, but automatic corrective measures to prevent failure seem to be kicking in more often than they should be.
- Memory access seems slow, so there's a possibility there might be bad RAM. Still testing that.
| Assignee | ||
Comment 7•14 years ago
|
||
So the disk issues turned out to be at fault. Phong went to the colo and pre-emptively replaced one of the drives that we thought looked like it was acting funny, and over 9000 emails delivered within the next 5 minutes. We're now entirely caught up. If anyone has issues with email again please contact us via https://bugzilla.mozilla.org/form.itrequest
And no, the DBZ reference was not intentional. But it's cool anyway! :)
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•