Closed
Bug 540250
Opened 16 years ago
Closed 15 years ago
Synchronization problems with ams-zeus AMO logs between im-log02 and im-log03
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dre, Assigned: mrz)
Details
WE are missing a chunk of data for 2010-01-16 between the hours of 05 and 12 PDT.
It looks like logs for zlb01 and zlb02 were not being synced for several hours. Some of them showed up late, but we still seem to be missing some.
I can re-process against im-log02 which appears to have all of the data, but we need to know why it happened and make sure it isn't still happening.
[deinspanjer@cm-metricsetl02 ~]$ (cd /mnt/stats_im-log03/stats/logs/im-log01/addons.mozilla.org; gzip -l access_2010-01-16*.ams-zlb*.gz | awk -f ~/bin/histogram_sum.awk scale=150000000)
01-16-00 - 3.37GB: *************************
01-16-01 - 4.28GB: *******************************
01-16-02 - 4.82GB: ***********************************
01-16-03 - 4.97GB: ************************************
01-16-04 - 5.17GB: **************************************
01-16-05 - 1.84GB: **************
01-16-06 - 5.49GB: ****************************************
01-16-07 - 5.54GB: ****************************************
01-16-08 - 5.66GB: *****************************************
01-16-09 - 5.55GB: ****************************************
01-16-10 - 5.30GB: **************************************
01-16-11 - 1.74GB: *************
01-16-12 - 1.58GB: ************
01-16-13 - 4.19GB: *******************************
01-16-14 - 3.55GB: **************************
01-16-15 - 2.79GB: ********************
01-16-16 - 2.07GB: ***************
01-16-17 - 1.64GB: ************
01-16-18 - 1.40GB: ***********
01-16-19 - 1.24GB: *********
01-16-20 - 1.20GB: *********
01-16-21 - 1.31GB: **********
01-16-22 - 1.62GB: ************
Total counts = 76.34GB
[deinspanjer@cm-metricsetl02 ~]$ (cd /mnt/stats_im-log02/stats/logs/im-log01/addons.mozilla.org; gzip -l access_2010-01-16*.ams-zlb*.gz | awk -f ~/bin/histogram_sum.awk scale=150000000)
01-16-00 - 3.37GB: *************************
01-16-01 - 4.28GB: *******************************
01-16-02 - 4.82GB: ***********************************
01-16-03 - 4.97GB: ************************************
01-16-04 - 5.17GB: **************************************
01-16-05 - 5.39GB: ***************************************
01-16-06 - 5.49GB: ****************************************
01-16-07 - 5.54GB: ****************************************
01-16-08 - 5.66GB: *****************************************
01-16-09 - 5.55GB: ****************************************
01-16-10 - 5.30GB: **************************************
01-16-11 - 4.92GB: ************************************
01-16-12 - 4.61GB: **********************************
01-16-13 - 4.19GB: *******************************
01-16-14 - 3.55GB: **************************
01-16-15 - 2.79GB: ********************
01-16-16 - 2.07GB: ***************
01-16-17 - 1.64GB: ************
01-16-18 - 1.40GB: ***********
01-16-19 - 1.24GB: *********
01-16-20 - 1.20GB: *********
01-16-21 - 1.31GB: **********
01-16-22 - 561.78MB: ****
Total counts = 85.02GB
Updated•16 years ago
|
Assignee: server-ops → thardcastle
| Assignee | ||
Comment 1•16 years ago
|
||
im-log03 is missing the following files:
access_2010-01-16-05.ams-zlb01.nl.mozilla.com_1.gz
access_2010-01-16-11.ams-zlb01.nl.mozilla.com_1.gz
access_2010-01-16-12.ams-zlb01.nl.mozilla.com_1.gz
access_2010-01-16-05.ams-zlb02.nl.mozilla.com_1.gz
access_2010-01-16-11.ams-zlb02.nl.mozilla.com_1.gz
access_2010-01-16-12.ams-zlb02.nl.mozilla.com_1.gz
This bug is more about "why didn't it work" than "get logs over to log03" right? Want to make sure I understand the urgency.
Comment 2•16 years ago
|
||
logs missing on either host should be a blocker, period. caused a bunch of trouble with bad data going out.
Severity: critical → blocker
| Reporter | ||
Comment 3•16 years ago
|
||
Yes to comment #1 and yes to comment #2.
I reran the old data from im-log02 so there is no need to move files over to im-log03. We only process the most recent files from log03.
I marked it as critical because we currently don't have a good detection for this kind of problem where we have logs but are just missing a few from a particular server.
We have other bugs on queue to enhance our processing to better detect that, but for the moment, we need to get to the bottom of why it happened so it doesn't trip us up again. This exact same problem happened with the ams servers back at the end of November.
| Assignee | ||
Comment 4•16 years ago
|
||
chizu - looks like ams-zlb03 is the only one consistently getting logs over.
ams-zlb01 & ams-zlb02 haven't sent anything since the access_2010-01-17-00 log.
| Assignee | ||
Comment 5•16 years ago
|
||
I pinged Daniel and asked him to switch processing over to log02 in the interim.
dmoore will look at this when he's back at a computer.
We hit this before, where two of the boxes were somehow limited in throughput to log03 (which is odd because one isn't and they all go over the same pipe and none of them appear to be limited to log02).
This isn't "we don't have enough bandwidth" - it's some other throughput issue. Neither side has flat lined to indicate less bandwidth.
dmoore, wonder what would happen if you statically routed log03 from AMS out ISC, incase the issue is with Mzima?
| Assignee | ||
Comment 6•16 years ago
|
||
Since I'm around I went and added a static route.
chizu, how do I run that iperf test?
(In reply to comment #6)
> Since I'm around I went and added a static route.
>
> chizu, how do I run that iperf test?
I ran 'iperf -s' on dm-nagios01 and connected to it from each of the ams-zlb hosts with 'iperf -c dm-nagios01'. It's showing a large improvement. 9Mbit/s up from 200Kbit/s earlier.
Assignee: thardcastle → dmoore
Comment 8•16 years ago
|
||
omg what a difference. almost a blocker to fix mzima? such an improvement - perhaps could point at other issues like we have with Toronto, which probably also go over mzima? How did we miss this until now?
| Assignee | ||
Comment 9•16 years ago
|
||
Toronto goes over Level3 (both directions).
I only changed on direction too - SJC to AMS is still over Mzima. AMS to SJC is ISC/he.net. Probably missed it because it's not normally a problem and doesn't get noticed.
Should be fun to escalate with Mzima.
| Assignee | ||
Updated•16 years ago
|
Group: infra
| Assignee | ||
Comment 10•16 years ago
|
||
Amsterdam src is different (63.245.213.6-8) but destination IP is the same - 63.245.208.171 for both im-log02 & im-log03 (just different ports, 3022 & 3023).
| Assignee | ||
Comment 11•16 years ago
|
||
Okay, root issue has been resolved. Was link congestion through Mzima. Need a better way to monitor that and pinpoint that as the issue. But this bug's summary is resolved.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 12•16 years ago
|
||
This issue cropped up again starting at 2010-02-01-13 through 2010-02-01-21. For the first few hours, im-log03 never received the logs. Then, they started appearing, but hours later than expected.
Assignee: dmoore → server-ops
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 13•16 years ago
|
||
dropping sev to stop the pages for now...
06:37:39 < deinspanjer> That said, I can work around it for now, so I'm fine with leaving it open for dmoore or someone to look at when they get in
Severity: blocker → major
Updated•16 years ago
|
Severity: major → critical
Comment 14•16 years ago
|
||
Everyone who can deal with this is in a car at the moment, and I'm the one getting the pages in the meantime. Per comment 13 this was waiting for the relevant folks to get to the office.
Severity: critical → major
| Reporter | ||
Comment 15•16 years ago
|
||
That's fine. It likely won't happen again until noon or 1 PST which is the peak traffic time for the Amsterdam cluster.
Comment 16•16 years ago
|
||
We're investigating. Although mzima is looking a little slow, at the moment, the previous workaround is still in place and ISC appears to be performing normally.
Comment 17•16 years ago
|
||
At this time, we're seeing improved throughput... up to the 4-7Mbps range.
| Reporter | ||
Comment 18•16 years ago
|
||
Looks like we made it through peak traffic of the ams cluster without a problem today.
I'm working on putting together a consistency script that you guys can call via a nagios plugin, but as far as this specific bug goes, is there some simple Nagios check you can put in that would warn of this particular problem if it happened again?
| Assignee | ||
Updated•16 years ago
|
Assignee: server-ops → mrz
| Reporter | ||
Comment 19•15 years ago
|
||
I believe this can be closed again now that we have better monitoring in place.
Status: REOPENED → RESOLVED
Closed: 16 years ago → 15 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•