Closed Bug 430737 Opened 13 years ago Closed 13 years ago
Update snippets sometimes don't propagate to the entire cluster
481 bytes, patch
|Details | Diff | Splinter Review|
699 bytes, patch
|Details | Diff | Splinter Review|
1.67 KB, patch
|Details | Diff | Splinter Review|
For the past three releases (Firefox 126.96.36.199 beta, Firefox 188.8.131.52, and now Thunderbird 184.108.40.206 beta), we've had an issue where after pushing update snippets don't appear immediately. oremj and justdave looked at the servers during the last two releases and it was discovered that some of the servers behind the AUS URL didn't have the update snippets at all. We confirmed that throttling was not on. From a "real world" perspective, this means that some users will receive updates while others, even when they force the check, won't receive an update at all. This is a blocker for releasing Thunderbird 220.127.116.11 next week.
This is some sort of NFS cache expiration issue. If I do an ls on the file it will say it is not there, but if I ls the directory above the file and then ls the file directly again it will be there.
What changed within the last month? The first time this was an issue was during the beta of Firefox 18.104.22.168. We haven't had a problem before that release.
Kernel upgrade. The previous kernel had an issue with not caching anything at all.
Filed a ticket with redhat support.
Severity: blocker → normal
We had this issue again with 3.0.1 and Jeremy is out. With no directions on how to refresh the files, Aravind tried to refresh them but isn't seeing files on the AUS servers. For the future, we need instructions. I'm sure these will update on their own as the cache expires anyway...
From IRC conversations, I think Jeremy is just remounting the NFS mount for the snippets, on each "webhead" serving AUS.
It's just mount -o remount /mnt/aus2 on all the webheads to refresh the files.
Did I hear on IRC that a new kernel was deployed but it's not helping ? Are there any other updates that could be noted ?
Reply from redhat below+ --------------------------------------------------------------------- I have been working with engineering on this case. As noted earlier(May 19, 2008), even when you get file not found, the Nfs debug logs does not have information regarding negative lookup. In the packet capture I can only see one GETATTR and one ACCESS calls for the "Firefox" directory (frames 16 to 19), both of which return an mtime of "May 30, 2008 16:26:52.000000000". This is different from the ctime of the directory, "May 30, 2008 17:14:01.214956000", which is unusual, since directories that have subdirectories added or removed from it will also have a change in the inode (number of links). Also notice how the ctime timestamp includes microseconds, while the mtime is a full second without microseconds. This is highly unlikely and looks like the mtime was artificially set by a tool such as tar. Can you please confirm whether you are using any tools like tar. If you are using tar to add data to that volume, have them use the "-m" or "--touch" option to tar so that it does not set mtimes. --------------------------------------------------------------------- Is it possible something like this is being done to put the snippets in place?
Here's the script we use to push snippets live: http://mxr.mozilla.org/mozilla/source/tools/release/bin/pushsnip Specifically, line 27: $RSYNC -Pa $STAGING_DIR/$1/ $LIVE_SNIPPET_DIR which ends up being something like...: /usr/bin/rsync -Pa /opt/aus2/snippets/staging/20080709-Thunderbird-22.214.171.124-test/ /opt/aus2/incoming/3 the -a switch is equivalent to -rlptgoD.. of interest is -t, which is 'preserve times'
Reply from redhat: ---------------------------------------------------------------------------- Use of rsync with "preserve times" is certainly the cause. It will add information to a directory (that they seem to add incrementally) but set the mtime back so that NFS clients cannot perceive it really changed. If it would be a local filesystem, the kernel would know for sure the directory changed and would have revalidated that information. Since it's NFS, the kernel by specification has to trust the mtime value to detect changes, as it is talking to a server and other clients can affect those directories and files without this client knowing about it. If you are changing a directory but setting it's mtime back to what it was, the NFS client cannot detect such change. I would suggest that you add a "touch /opt/aus2/incoming/3" after doing the rsync, so that the mtime of the directory is updated and the NFS clients can correctly revalidate their cached information. ---------------------------------------------------------------------------- Can you make that change, so I can close this bug?
Sure, we can give that a go.
Attachment #333244 - Flags: review?(nthomas) → review+
Comment on attachment 333244 [details] [diff] [review] [checked in] touch LIVE_SNIPPET_DIR to try and fix the propagation problem Checking in pushsnip; /cvsroot/mozilla/tools/release/bin/pushsnip,v <-- pushsnip new revision: 1.2; previous revision: 1.1 done Okay, next time we do a release we'll see how this goes.
Attachment #333244 - Attachment description: touch LIVE_SNIIPET_DIR to try and fix the propagation problem → [checked in] touch LIVE_SNIPPET_DIR to try and fix the propagation problem
Whiteboard: Waiting for redhat fix. → Waiting on a release to test.
I updated pushsnip for bug 394046, but I'm not sure rolling out the major update test verifies a fix here. IIRC we've had no problem with new snippets, just with replacing old ones.
Should we keep it open until the next release or assume fixed close and reopen or verify on the next release?
May as well resolve fixed - it seems pretty likely this will fix the issue.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Alas, it didn't. The Fx126.96.36.199 and Fx3.0.2 snippets that got pushed are not showing up consistently, indicated some of the machines behind AUS haven't picked up the new snippets. Clock is right on aus2-staging (bug 455965), and the top level is getting touched correctly: $ ls -la /opt/aus2/incoming/3 total 24 drwxrwxr-x 5 cltbld cltbld 4096 Sep 23 16:32 . drwxr-xr-x 6 cltbld cltbld 4096 May 23 2007 .. drwxrwxr-x 55 cltbld cltbld 8192 Aug 29 18:18 Firefox To resolve the blocker part of this, please remount the snippet store. Then we can debug this again.
Severity: major → blocker
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
fyi: in case it helps with debugging, the FF3.0.2 snippets were pushed at 4.29pm, and the FF188.8.131.52 snippets were pushed at 4.33pm.
From quick chat with oremj, the theory is that the shared file system behind AUS is confused by the older timestamps of the files generated in the release process. For next release, do rsync/pushsnip as usual, and if we hit this problem: - go into the directories and "touch" each file. - wait 4mins to allow for 2min timeout scattered on machines with slightly different clocks. - see if newly touched snippets are now visible. - if snippets visible, this proves the theory. :-)
Alright, I'll try that when we ship tb184.108.40.206 on thursday.
Severity: blocker → major
We did hit this with Tb220.127.116.11, so I touched /opt/aus2/incoming/3/Thunderbird/18.104.22.168/WINNT_x86-msvc/2008070808/en-US/release/* waited 5 minutes, but still got intermittent empty-updates at https://aus2.mozilla.org/update/1/Thunderbird/22.214.171.124/2008070808/WINNT_x86-msvc/en-US/release/update.xml oremj reports that touching every intermediate dir from /opt/aus2/incoming/3 down to the snippets gets all the AUS heads working. I'll investigate if we can not preserve times on the push.
Assignee: oremj → nthomas
Happened again, while testing snippets pushed to beta channel for FF3.0.3. I touched the updates which were reported by QA to not appear, and then QA was able to see them.
From comment #12 and comment #23, we should let the OS update the modification timestamps on the directories, then nfs knows when to invalidate its cache. This patch uses rsync's -O option while keeping the -t for the files (via -a), from the man on aus2-staging: -O, --omit-dir-times omit directories when preserving times If this doesn't work, then I've got a patch that rsyncs a copy of the snippets without preserving time, then publishes the copy.
Attachment #345046 - Flags: review?(bhearsum)
Comment on attachment 345046 [details] [diff] [review] [checked in] Exclude directories from rsync's preserve times Oh, this also removes the pushd & popd, which have no effect when the rsync and touch use absolute paths.
(In reply to comment #26) Better phrasing would be ".. have no effect _because_ the rsync and touch use absolute paths".
Whiteboard: Waiting on a release to test.
Attachment #345046 - Flags: review?(bhearsum) → review+
Comment on attachment 345046 [details] [diff] [review] [checked in] Exclude directories from rsync's preserve times Checking in pushsnip; /cvsroot/mozilla/tools/release/bin/pushsnip,v <-- pushsnip new revision: 1.3; previous revision: 1.2 done And cvs up'd on aus2-staging.
Attachment #345046 - Attachment description: Exclude directories from rsync's preserve times → [checked in] Exclude directories from rsync's preserve times
Might as well bring this over to RelEng.
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Whiteboard: Waiting on a release to test
Assuming this is fixed until I hear otherwise from QA.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
This is the brute force solution I mentioned in comment #25. It's had some testing but could do with some more.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.