Closed Bug 567847 Opened 10 years ago Closed 10 years ago

[tracking] intermittent caching problems in build proxy

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aki, Assigned: jabba)

References

Details

(Whiteboard: [meeds releng downtime])

We occasionally hit issues with certain files getting a corrupt version cached in the build proxy, causing breakage for all tests that reference that file.

e.g.

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274399232.1274399865.15943.gz

C:\Windows\system32\cmd.exe /c unzip -o firefox-3.7a5pre.en-US.win32.zip

...

  inflating: firefox/xul.dll          bad CRC ff3b34a5  (should be f3a57fd9)
program finished with exit code 2
elapsedTime=0.985000

Once we clear the cache of that file, it starts working again.

We'd like a more consistent solution, but not sure what that would be... I wouldn't be surprised if we don't fully fix this until we move things into PHX. But if we find a solution that would be awesome.

Not sure how to represent the relationship with bug 555794 (mpt<->mv timeouts); will mark as depending on for the moment.
There isn't any plan right now to move build infra to phoenix.  More likely to a nearby colo facility.
k.

s,PHX,other location with big pipe to hg+stage, :)
Assignee: server-ops → jdow
raising to critical as this is breaking builds in production.
Severity: normal → critical
jabba wants to try upgrading squid during downtime this week, and see if that helps.
Adding dmoore to CC. Will need a small amount of coordination with him to flip WCCP routing to squid on and off during upgrade.
Squid has been updated to 2.7. I think this issue will be resolved by that, so I'm changing this to be a tracking bug and lowering priority. If we don't have any more caching problems in a week, this bug shall also be closed.
Severity: critical → normal
Summary: intermittent caching problems in build proxy → [tracking] intermittent caching problems in build proxy
Whiteboard: [close bug on 6/16 if no more problems]
Depends on: 567992
Looks like the upgrade worked. Haven't heard anything go wrong since last week, so closing the bug.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Is this the only failure we've had since the update? If we only see one failure in a two-week timespan, I'm not sure if there is much troubleshooting that can be done. Is this something we can live with until a future move to a new colo?
No, it's not just one in two weeks - I'd say more like three in the last two days, but that's just the ones I know about, since they don't show up on tinderboxpushlog, and nobody uses tinderbox (or if they do they don't ever treat ten burning boxes as something they should say anything about), so they only way these ever get seen is when I'm watching one of the IRC channels where firebot mentions a dozen tests all burning at once.

You *might* be able to get a better picture of the frequency by scanning through the tail of http://kuix.de/mozilla/tinderboxstat/dump.txt looking for times when the unstarred red count jumps up to ten, though the one last night was against pattern, and rather than hitting all of the (opt or debug) tests for a platform, only hit half of them, which really shouldn't happen with the way I thought the cache was supposed to work.
Philor just ran into this with http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx-debug/1278547113/firefox-4.0b2pre.en-US.mac.tests.zip . I'm looking for the command to clear the cache.

But yeah, one file in 2 weeks does seem relatively minor ?
Do you have any reason, any reason at all, to believe it's one in two weeks?

Isn't only reason you know about that one is because I was looking at #firebot right at the time when the bot announced ten things had started burning all at once? If I hadn't pinged you about it, would you have looked at tinderbox (remember, not tbpl, tinderbox) sometime in the next hour or so, before the next push turned everything green? Would anyone else? Do you have some other monitoring, besides me, that would alert you to most or all of the tests against a particular changeset for a particular OS going red?
(In reply to comment #12)
> Do you have any reason, any reason at all, to believe it's one in two weeks?

True.

(In reply to comment #11)
> I'm looking for the command to clear the cache.

Got it, just had the machine name wrong (bc-proxy01); once I had that, history gives me the command |/usr/sbin/squidclient -m purge <URL>| :)
http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-debug/1278605245/firefox-4.0b2pre.en-US.linux-x86_64.tests.zip

And while I should have put it more politely, I should also have put it more strongly and certainly: we do not know what the frequency is, the only thing that we do know is that it is certainly more than one in two weeks.
So, I'm not sure what the next step is here. If Squid is causing more problems than it is solving, perhaps we should look towards other solutions? It will likely still be a while before we are able to migrate the minis to a dedicated data center.

Would something like clearing the cache on a nightly basis help out? Or maybe only using the cache during office hours and using direct bandwidth at night when nobody is in the office?
Ops+RelEng discussed this productively and have ideas; joduinn has the comprehensive list.
Step 1 is generating checksums + comparing on download, on RelEng.
Assigning to RelEng for the checksum implementation.
Assignee: jdow → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Whiteboard: [close bug on 6/16 if no more problems]
Assignee: nobody → jhford
I am working on generating the checksums file in bug 578393.  I am going to make this bug dependent on that bug as I feel that it is important to at least know when and how frequent this corruption is while we debug this problem.
(In reply to comment #16)
> Ops+RelEng discussed this productively and have ideas; joduinn has the
> comprehensive list.
> Step 1 is generating checksums + comparing on download, on RelEng.

bug#578393 is tracking the "generate checksum + compare on upload/download".
That does not fix the underlying proxy/cache problems, it is just an attempt to
detect problems quickly. 

(In reply to comment #17)
> Assigning to RelEng for the checksum implementation.
Nope, this bug lives in ServerOps, to track work by IT to fix squid proxy. In
our meeting last week, there were several suggestions, including one to turn
off pipelining in squid configs. When can we try these squid config changes?
Assignee: jhford → jdow
Component: Release Engineering → Server Operations
QA Contact: release → mrz
(In reply to comment #18)
> I am working on generating the checksums file in bug 578393.  I am going to
> make this bug dependent on that bug as I feel that it is important to at least
> know when and how frequent this corruption is while we debug this problem.

Its useful to have checksums to help detect problems more quickly. However checksums do not prevent IT from changing proxy configs, so bug#578393 is not a blocker imho.
(In reply to comment #19)
> bug#578393 is tracking the "generate checksum + compare on upload/download".
> That does not fix the underlying proxy/cache problems, it is just an attempt to
> detect problems quickly. 

One of my goals is to clear the proxy cache whenever a corrupt build is hit and try to download the build again.  This is in addition to getting notified that there was cache corruption.
Sorry, I didn't realize there was a separate bug for generating checksums. We can modify the pipeline directive (actually called collapsed forwarding) anytime, but we should probably do it during a downtime, since I don't know what would happen to downloads in progress. When would be a good time to do this?
Bug 575969 tracks a downtime for this week.  Maybe check with Lukas to see if this is acceptable then.
something is wrong with squid. It doesn't start cleanly and there are core dumps. It is currently off.
It looks like squid got restarted somehow and has indeed been running the last week or so with the collapsed forwarding directive turned off. How are downloads going? Are there any more or less corruption issues? Also, are we still seeing any bandwidth benefits despite the collapsed forwarding being off?
I didn't monitor the bandwidth over the last week but I did see one or two instances of a corrupted file. Not sure if that's more or less than before, though.
Ok, the next step then is to rebuild bc-proxy01 with Ubuntu 10.04 which ships with squid 3.0 or 2.7. The version of 2.7 we are running on RHEL right now isn't officially supported, so this should help rule out any problems there. I'll get that box up this week.
Or maybe Fedora 13, since it might have an even more up-to-date version of squid.
I've got a new proxy configured. This one running Ubuntu 10.04 which comes with Squid 3.0. I'd like to start using this instead of the current one. The new one is on mv-buildproxy01. I'm currently configuring a DNS cache on that box as well, so that we can then turn off the old bc-proxy01. 

Theoretically moving this proxy over shouldn't need a downtime, but I think I'd prefer to do the switch during a downtime anyway just to play it safe. I'll need myself, dmoore and whoever is on buildduty to agree on a time.
Depends on: 584410
Still need to get the wccp interface up on mv-buildproxy01. I filed bug 584410 to track that.
Ok, wccp0 interface is up and everything appears ready to go for the switchover.
Whiteboard: [meeds releng downtime]
Blocks: 586793
The new squid proxy is in operation. Closing this bug for now. If any more problems arise, then re-open.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.