Closed
Bug 430737
Opened 16 years ago
Closed 16 years ago
Update snippets sometimes don't propagate to the entire cluster
Categories
(Release Engineering :: General, defect, P2)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: samuel.sidler+old, Assigned: nthomas)
References
()
Details
(Whiteboard: Waiting on a release to test)
Attachments
(3 files)
481 bytes,
patch
|
nthomas
:
review+
|
Details | Diff | Splinter Review |
699 bytes,
patch
|
bhearsum
:
review+
|
Details | Diff | Splinter Review |
1.67 KB,
patch
|
Details | Diff | Splinter Review |
For the past three releases (Firefox 2.0.0.14 beta, Firefox 2.0.0.14, and now Thunderbird 2.0.0.14 beta), we've had an issue where after pushing update snippets don't appear immediately. oremj and justdave looked at the servers during the last two releases and it was discovered that some of the servers behind the AUS URL didn't have the update snippets at all. We confirmed that throttling was not on. From a "real world" perspective, this means that some users will receive updates while others, even when they force the check, won't receive an update at all. This is a blocker for releasing Thunderbird 2.0.0.14 next week.
Comment 1•16 years ago
|
||
This is some sort of NFS cache expiration issue. If I do an ls on the file it will say it is not there, but if I ls the directory above the file and then ls the file directly again it will be there.
Comment 2•16 years ago
|
||
What changed within the last month? The first time this was an issue was during the beta of Firefox 2.0.0.14. We haven't had a problem before that release.
Comment 3•16 years ago
|
||
Kernel upgrade. The previous kernel had an issue with not caching anything at all.
Comment 4•16 years ago
|
||
Filed a ticket with redhat support.
Updated•16 years ago
|
OS: Mac OS X → All
Hardware: PC → All
Updated•16 years ago
|
Severity: major → blocker
Updated•16 years ago
|
Severity: normal → major
Reporter | ||
Comment 6•16 years ago
|
||
We had this issue again with 3.0.1 and Jeremy is out. With no directions on how to refresh the files, Aravind tried to refresh them but isn't seeing files on the AUS servers. For the future, we need instructions. I'm sure these will update on their own as the cache expires anyway...
Assignee | ||
Comment 7•16 years ago
|
||
From IRC conversations, I think Jeremy is just remounting the NFS mount for the snippets, on each "webhead" serving AUS.
Comment 8•16 years ago
|
||
It's just mount -o remount /mnt/aus2 on all the webheads to refresh the files.
Updated•16 years ago
|
QA Contact: justin → mrz
Assignee | ||
Comment 9•16 years ago
|
||
Did I hear on IRC that a new kernel was deployed but it's not helping ? Are there any other updates that could be noted ?
Updated•16 years ago
|
Whiteboard: Waiting for redhat fix.
Comment 10•16 years ago
|
||
Reply from redhat below+ --------------------------------------------------------------------- I have been working with engineering on this case. As noted earlier(May 19, 2008), even when you get file not found, the Nfs debug logs does not have information regarding negative lookup. In the packet capture I can only see one GETATTR and one ACCESS calls for the "Firefox" directory (frames 16 to 19), both of which return an mtime of "May 30, 2008 16:26:52.000000000". This is different from the ctime of the directory, "May 30, 2008 17:14:01.214956000", which is unusual, since directories that have subdirectories added or removed from it will also have a change in the inode (number of links). Also notice how the ctime timestamp includes microseconds, while the mtime is a full second without microseconds. This is highly unlikely and looks like the mtime was artificially set by a tool such as tar. Can you please confirm whether you are using any tools like tar. If you are using tar to add data to that volume, have them use the "-m" or "--touch" option to tar so that it does not set mtimes. --------------------------------------------------------------------- Is it possible something like this is being done to put the snippets in place?
Comment 11•16 years ago
|
||
Here's the script we use to push snippets live: http://mxr.mozilla.org/mozilla/source/tools/release/bin/pushsnip Specifically, line 27: $RSYNC -Pa $STAGING_DIR/$1/ $LIVE_SNIPPET_DIR which ends up being something like...: /usr/bin/rsync -Pa /opt/aus2/snippets/staging/20080709-Thunderbird-2.0.0.16-test/ /opt/aus2/incoming/3 the -a switch is equivalent to -rlptgoD.. of interest is -t, which is 'preserve times'
Comment 12•16 years ago
|
||
Reply from redhat: ---------------------------------------------------------------------------- Use of rsync with "preserve times" is certainly the cause. It will add information to a directory (that they seem to add incrementally) but set the mtime back so that NFS clients cannot perceive it really changed. If it would be a local filesystem, the kernel would know for sure the directory changed and would have revalidated that information. Since it's NFS, the kernel by specification has to trust the mtime value to detect changes, as it is talking to a server and other clients can affect those directories and files without this client knowing about it. If you are changing a directory but setting it's mtime back to what it was, the NFS client cannot detect such change. I would suggest that you add a "touch /opt/aus2/incoming/3" after doing the rsync, so that the mtime of the directory is updated and the NFS clients can correctly revalidate their cached information. ---------------------------------------------------------------------------- Can you make that change, so I can close this bug?
Comment 13•16 years ago
|
||
Sure, we can give that a go.
Comment 14•16 years ago
|
||
Attachment #333244 -
Flags: review?(nthomas)
Assignee | ||
Updated•16 years ago
|
Attachment #333244 -
Flags: review?(nthomas) → review+
Comment 15•16 years ago
|
||
Comment on attachment 333244 [details] [diff] [review] [checked in] touch LIVE_SNIPPET_DIR to try and fix the propagation problem Checking in pushsnip; /cvsroot/mozilla/tools/release/bin/pushsnip,v <-- pushsnip new revision: 1.2; previous revision: 1.1 done Okay, next time we do a release we'll see how this goes.
Attachment #333244 -
Attachment description: touch LIVE_SNIIPET_DIR to try and fix the propagation problem → [checked in] touch LIVE_SNIPPET_DIR to try and fix the propagation problem
Updated•16 years ago
|
Whiteboard: Waiting for redhat fix. → Waiting on a release to test.
Assignee | ||
Comment 16•16 years ago
|
||
I updated pushsnip for bug 394046, but I'm not sure rolling out the major update test verifies a fix here. IIRC we've had no problem with new snippets, just with replacing old ones.
Comment 17•16 years ago
|
||
Should we keep it open until the next release or assume fixed close and reopen or verify on the next release?
Comment 18•16 years ago
|
||
May as well resolve fixed - it seems pretty likely this will fix the issue.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 19•16 years ago
|
||
Alas, it didn't. The Fx2.0.0.17 and Fx3.0.2 snippets that got pushed are not showing up consistently, indicated some of the machines behind AUS haven't picked up the new snippets. Clock is right on aus2-staging (bug 455965), and the top level is getting touched correctly: $ ls -la /opt/aus2/incoming/3 total 24 drwxrwxr-x 5 cltbld cltbld 4096 Sep 23 16:32 . drwxr-xr-x 6 cltbld cltbld 4096 May 23 2007 .. drwxrwxr-x 55 cltbld cltbld 8192 Aug 29 18:18 Firefox To resolve the blocker part of this, please remount the snippet store. Then we can debug this again.
Severity: major → blocker
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 20•16 years ago
|
||
fyi: in case it helps with debugging, the FF3.0.2 snippets were pushed at 4.29pm, and the FF2.0.0.17 snippets were pushed at 4.33pm.
Comment 21•16 years ago
|
||
From quick chat with oremj, the theory is that the shared file system behind AUS is confused by the older timestamps of the files generated in the release process. For next release, do rsync/pushsnip as usual, and if we hit this problem: - go into the directories and "touch" each file. - wait 4mins to allow for 2min timeout scattered on machines with slightly different clocks. - see if newly touched snippets are now visible. - if snippets visible, this proves the theory. :-)
Assignee | ||
Comment 22•16 years ago
|
||
Alright, I'll try that when we ship tb2.0.0.17 on thursday.
Severity: blocker → major
Assignee | ||
Comment 23•16 years ago
|
||
We did hit this with Tb2.0.0.17, so I touched /opt/aus2/incoming/3/Thunderbird/2.0.0.16/WINNT_x86-msvc/2008070808/en-US/release/* waited 5 minutes, but still got intermittent empty-updates at https://aus2.mozilla.org/update/1/Thunderbird/2.0.0.16/2008070808/WINNT_x86-msvc/en-US/release/update.xml oremj reports that touching every intermediate dir from /opt/aus2/incoming/3 down to the snippets gets all the AUS heads working. I'll investigate if we can not preserve times on the push.
Assignee: oremj → nthomas
Comment 24•16 years ago
|
||
Happened again, while testing snippets pushed to beta channel for FF3.0.3. I touched the updates which were reported by QA to not appear, and then QA was able to see them.
Assignee | ||
Updated•16 years ago
|
Priority: -- → P2
Assignee | ||
Comment 25•16 years ago
|
||
From comment #12 and comment #23, we should let the OS update the modification timestamps on the directories, then nfs knows when to invalidate its cache. This patch uses rsync's -O option while keeping the -t for the files (via -a), from the man on aus2-staging: -O, --omit-dir-times omit directories when preserving times If this doesn't work, then I've got a patch that rsyncs a copy of the snippets without preserving time, then publishes the copy.
Attachment #345046 -
Flags: review?(bhearsum)
Assignee | ||
Comment 26•16 years ago
|
||
Comment on attachment 345046 [details] [diff] [review] [checked in] Exclude directories from rsync's preserve times Oh, this also removes the pushd & popd, which have no effect when the rsync and touch use absolute paths.
Assignee | ||
Comment 27•16 years ago
|
||
(In reply to comment #26) Better phrasing would be ".. have no effect _because_ the rsync and touch use absolute paths".
Whiteboard: Waiting on a release to test.
Updated•16 years ago
|
Attachment #345046 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 28•16 years ago
|
||
Comment on attachment 345046 [details] [diff] [review] [checked in] Exclude directories from rsync's preserve times Checking in pushsnip; /cvsroot/mozilla/tools/release/bin/pushsnip,v <-- pushsnip new revision: 1.3; previous revision: 1.2 done And cvs up'd on aus2-staging.
Attachment #345046 -
Attachment description: Exclude directories from rsync's preserve times → [checked in] Exclude directories from rsync's preserve times
Assignee | ||
Comment 29•16 years ago
|
||
Might as well bring this over to RelEng.
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Whiteboard: Waiting on a release to test
Assignee | ||
Comment 30•16 years ago
|
||
Assuming this is fixed until I hear otherwise from QA.
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 31•16 years ago
|
||
This is the brute force solution I mentioned in comment #25. It's had some testing but could do with some more.
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•