Closed Bug 567992 Opened 9 years ago Closed 9 years ago

bc-proxy01 should always serve the latest version of a file

Categories

(mozilla.org Graveyard :: Server Operations, task, critical)

x86
macOS
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: jabba)

References

Details

(Whiteboard: [Thursday 6/10 6am])

Filing based on https://bugzilla.mozilla.org/show_bug.cgi?id=567393#c2:

(In reply to comment #2)
> Is the proxy not HEADing stage and checking Last-Modified-By before serving
> from its cache ?


We sometimes have files which get overwritten on stage, and the proxy should be able to deal with this without great delay. I suspect doing something like the above would reduce the painfulness of corrupted files in the cache.
According to this: http://www.comfsm.fm/computing/squid/FAQ-12.html#ss12.20 , that is part of Squid's algorithm for cache freshness. I'm not sure how to troubleshoot further, but I'll start looking through the configuration directives to see if there is something that can be toggled to make these checks more accurate.
This is becoming a more serious issue. A bunch of files related to the 3.6.4 release were corrupted today, which I had to purge one by one. For releases, this is a very serious problem that can take hours to recover from, because it can take a long time to find all the corrupted files.
Severity: normal → critical
Assignee: server-ops → jdow
Here's a bit of investigation I did:
In the latest case, the files were exactly the same size, but with different sums:
-bash-3.2$ ls -l Firefox\ Setup\ 3.6.3.exe
-rwxr-xr-x 1 cltbld mozilla 8187568 May 28 09:10 Firefox Setup 3.6.3.exe
-bash-3.2$ ls -l Firefox\ Setup\ 3.6.3.exe.2
-rwxr-xr-x 1 cltbld mozilla 8187568 May 28 09:10 Firefox Setup 3.6.3.exe.2
-bash-3.2$ md5sum Firefox\ Setup\ 3.6.3.exe
c87fbc334798abba9573087da1570c50  Firefox Setup 3.6.3.exe
-bash-3.2$ md5sum Firefox\ Setup\ 3.6.3.exe.2
eea3826485efe8c992e5b6c959b0b1f7  Firefox Setup 3.6.3.exe.2


A hexdump shows that only 2 nibbles are different:
--- corrupt.hex	2010-05-28 09:11:22.000000000 -0700
+++ good.hex	2010-05-28 09:11:36.000000000 -0700
@@ -424310,11 +424310,11 @@
 0679cf0 12b7 9e13 68ca ad93 fba6 8996 bbdd 9c0a
 0679d00 0769 9ce6 154f b300 0adc ec17 7f01 99bd
 0679d10 e1fc 976d a83f 0027 dfaf d75e 6dc2 1fc4
-0679d20 5c8a 2e0e 7dc2 cfcb 5353 e92f a7b3 3771
+0679d20 5c8a 2e0e 7dc2 cfcb 5355 e92f a7b3 3771
 0679d30 9c54 21eb 988d 206d 6f6d d1f4 5aaf 8ec1
 0679d40 0ae8 4447 50e7 52c0 959a f6bc e1cd 15fc
 0679d50 0cb4 ba6f ab63 a8ff 39f3 bc38 960e 9f49
-0679d60 e9f4 ccb7 3482 57c8 6157 0303 5b1f cbd7
+0679d60 e9f4 ccb7 3482 57c8 6155 0303 5b1f cbd7
 0679d70 f39a 0108 5345 6616 ddc9 7447 9766 3127
 0679d80 02a6 072c e511 2ece c633 ec78 c7e0 9801
 0679d90 661b 6483 f4c4 893f 5e72 aa15 713a 3e2e
We have confirmed that the origin file is still valid, right? That is, once you flush the cache the correct file is pulled down... no need to fix it on the web server?
We can try using disk cache exclusively (instead of memory cache) to rule out possible memory errors. The performance difference isn't important for this application.
(In reply to comment #5)
> We can try using disk cache exclusively (instead of memory cache) to rule out
> possible memory errors. The performance difference isn't important for this
> application.

Sounds like it's worth a try, then
I disabled caching in memory.
We had some more corruption this morning
I think our only option is to upgrade to a current (2.7 or 3.1) release. There are a lot of cache corruption fixes which haven't been backported to the 2.6 release we're running, and we could be hitting any one of them.
(In reply to comment #9)
> I think our only option is to upgrade to a current (2.7 or 3.1) release. There
> are a lot of cache corruption fixes which haven't been backported to the 2.6
> release we're running, and we could be hitting any one of them.

Is that a big deal to do? Is there much risk? Sounds like something we should do in a downtime, if we do it.
I'm not sure there is much risk. Also, theoretically, if we take the proxy down, traffic should still flow normally unproxied and assuming our current pipe to MPT can handle the traffic, this shouldn't require any downtime as long as dmoore can tell the firewall to gracefully stop forwarding traffic to squid (i.e. don't kill current connections, but just stop sending new connections).

What are your thoughts, Derek?
I think the only concern is our lack of ability to gracefully transition connections which are currently open. Shutting down the proxy or disabling it in the router will both result in termination of active connections. This is an instantaneous, one-time hit but I'm not sure how build will want to approach it.
I'm more concerned about what happens when we bring the new version up and potential new issues with it.
Then we should do it during a downtime, such that we leave ourselves enough time to rollback if needed.

When would the next available RelEng downtime be, where we can do the upgrade?
Whiteboard: [needs downtime]
Blocks: 567910
Whiteboard: [needs downtime] → [Thursday 6/10 6am]
jabba and I will doing this on Thursday morning during 6AM-9AM PDT.
I will send a note the day before.
Squid has been updated to version 2.7 using RedHat's unofficial rpm. So far it is working fine. Please make a new bug for any future issues, since the version bump changes everything.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.