If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Update snippets sometimes don't propagate to the entire cluster

RESOLVED FIXED

Status

Release Engineering
General
P2
major
RESOLVED FIXED
10 years ago
4 years ago

People

(Reporter: Samuel Sidler (old account; do not CC), Assigned: nthomas)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: Waiting on a release to test, URL)

Attachments

(3 attachments)

For the past three releases (Firefox 2.0.0.14 beta, Firefox 2.0.0.14, and now Thunderbird 2.0.0.14 beta), we've had an issue where after pushing update snippets don't appear immediately.

oremj and justdave looked at the servers during the last two releases and it was discovered that some of the servers behind the AUS URL didn't have the update snippets at all. We confirmed that throttling was not on.

From a "real world" perspective, this means that some users will receive updates while others, even when they force the check, won't receive an update at all.

This is a blocker for releasing Thunderbird 2.0.0.14 next week.

Comment 1

10 years ago
This is some sort of NFS cache expiration issue.  If I do an ls on the file it will say it is not there, but if I ls the directory above the file and then ls the file directly again it will be there.

What changed within the last month? The first time this was an issue was during the beta of Firefox 2.0.0.14. We haven't had a problem before that release.

Comment 3

10 years ago
Kernel upgrade.  The previous kernel had an issue with not caching anything at all.

Comment 4

10 years ago
Filed a ticket with redhat support.
OS: Mac OS X → All
Hardware: PC → All

Updated

10 years ago
Assignee: server-ops → oremj
Severity: major → blocker

Comment 5

9 years ago
Files refreshed.
Severity: blocker → normal
Severity: normal → major
We had this issue again with 3.0.1 and Jeremy is out. With no directions on how to refresh the files, Aravind tried to refresh them but isn't seeing files on the AUS servers. For the future, we need instructions. I'm sure these will update on their own as the cache expires anyway...
(Assignee)

Comment 7

9 years ago
From IRC conversations, I think Jeremy is just remounting the NFS mount for the snippets, on each "webhead" serving AUS.

Comment 8

9 years ago
It's just mount -o remount /mnt/aus2 on all the webheads to refresh the files.

Updated

9 years ago
QA Contact: justin → mrz
(Assignee)

Comment 9

9 years ago
Did I hear on IRC that a new kernel was deployed but it's not helping ? Are there any other updates that could be noted ?

Updated

9 years ago
Whiteboard: Waiting for redhat fix.
Reply from redhat below+

---------------------------------------------------------------------
I have been working with engineering on this case.

As noted earlier(May 19, 2008), even when you get file not found, the Nfs debug logs does not have information regarding negative lookup. 

In the packet capture I can only see one GETATTR and one ACCESS calls for the "Firefox" directory (frames 16 to 19), both of which return an mtime of "May 30, 2008 16:26:52.000000000".

This is different from the ctime of the directory, "May 30, 2008 17:14:01.214956000", which is unusual, since directories that have subdirectories added or removed from it will also have a change in the inode (number of links). Also notice how the ctime timestamp includes microseconds, while the mtime is a full second without microseconds. This is highly unlikely and looks like the mtime was artificially set by a tool such as tar.

Can you please confirm whether you are using any tools like tar. 

If you are using tar to add data to that volume, have them use the "-m" or "--touch" option to tar so that it does not set mtimes.
---------------------------------------------------------------------

Is it possible something like this is being done to put the snippets in place?
Here's the script we use to push snippets live:
http://mxr.mozilla.org/mozilla/source/tools/release/bin/pushsnip

Specifically, line 27:
$RSYNC -Pa $STAGING_DIR/$1/ $LIVE_SNIPPET_DIR

which ends up being something like...:
/usr/bin/rsync -Pa /opt/aus2/snippets/staging/20080709-Thunderbird-2.0.0.16-test/ /opt/aus2/incoming/3

the -a switch is equivalent to -rlptgoD..
of interest is -t, which is 'preserve times'
Reply from redhat:

----------------------------------------------------------------------------
Use of rsync with "preserve times" is certainly the cause. It will add information to a directory (that they seem to add incrementally) but set the mtime back so that NFS clients cannot perceive it really changed.

If it would be a local filesystem, the kernel would know for sure the directory changed and would have revalidated that information. Since it's NFS, the kernel by specification has to trust the mtime value to detect changes, as it is talking to a server and other clients can affect those directories and files without this client knowing about it.

If you are changing a directory but setting it's mtime back to what it was, the NFS client cannot detect such change.

I would suggest that you add a "touch /opt/aus2/incoming/3" after doing the rsync, so that the mtime of the directory is updated and the NFS clients can correctly revalidate their cached information.

----------------------------------------------------------------------------

Can you make that change, so I can close this bug?
Sure, we can give that a go.
Created attachment 333244 [details] [diff] [review]
[checked in] touch LIVE_SNIPPET_DIR to try and fix the propagation problem
Attachment #333244 - Flags: review?(nthomas)
(Assignee)

Updated

9 years ago
Attachment #333244 - Flags: review?(nthomas) → review+
Comment on attachment 333244 [details] [diff] [review]
[checked in] touch LIVE_SNIPPET_DIR to try and fix the propagation problem

Checking in pushsnip;
/cvsroot/mozilla/tools/release/bin/pushsnip,v  <--  pushsnip
new revision: 1.2; previous revision: 1.1
done



Okay, next time we do a release we'll see how this goes.
Attachment #333244 - Attachment description: touch LIVE_SNIIPET_DIR to try and fix the propagation problem → [checked in] touch LIVE_SNIPPET_DIR to try and fix the propagation problem

Updated

9 years ago
Whiteboard: Waiting for redhat fix. → Waiting on a release to test.
(Assignee)

Comment 16

9 years ago
I updated pushsnip for bug 394046, but I'm not sure rolling out the major update test verifies a fix here. IIRC we've had no problem with new snippets, just with replacing old ones. 
Should we keep it open until the next release or assume fixed close and reopen or verify on the next release?
May as well resolve fixed - it seems pretty likely this will fix the issue.
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
(Assignee)

Updated

9 years ago
Depends on: 455965
(Assignee)

Comment 19

9 years ago
Alas, it didn't. The Fx2.0.0.17 and Fx3.0.2 snippets that got pushed are not showing up consistently, indicated some of the machines behind AUS haven't picked up the new snippets.

Clock is right on aus2-staging (bug 455965), and the top level is getting touched correctly:
  $ ls -la /opt/aus2/incoming/3
  total 24
  drwxrwxr-x  5 cltbld cltbld 4096 Sep 23 16:32 .
  drwxr-xr-x  6 cltbld cltbld 4096 May 23  2007 ..
  drwxrwxr-x 55 cltbld cltbld 8192 Aug 29 18:18 Firefox

To resolve the blocker part of this, please remount the snippet store. Then we can debug this again.
Severity: major → blocker
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
fyi: in case it helps with debugging, the FF3.0.2 snippets were pushed at 4.29pm, and the FF2.0.0.17 snippets were pushed at 4.33pm.
From quick chat with oremj, the theory is that the shared file system behind AUS is confused by the older timestamps of the files generated in the release process.

For next release, do rsync/pushsnip as usual, and if we hit this problem:
- go into the directories and "touch" each file. 
- wait 4mins to allow for 2min timeout scattered on machines with slightly different clocks.
- see if newly touched snippets are now visible.
- if snippets visible, this proves the theory. :-)
(Assignee)

Comment 22

9 years ago
Alright, I'll try that when we ship tb2.0.0.17 on thursday.
Severity: blocker → major
(Assignee)

Comment 23

9 years ago
We did hit this with Tb2.0.0.17, so I touched 
  /opt/aus2/incoming/3/Thunderbird/2.0.0.16/WINNT_x86-msvc/2008070808/en-US/release/*
waited 5 minutes, but still got intermittent empty-updates at 
  https://aus2.mozilla.org/update/1/Thunderbird/2.0.0.16/2008070808/WINNT_x86-msvc/en-US/release/update.xml

oremj reports that touching every intermediate dir from /opt/aus2/incoming/3 down to the snippets gets all the AUS heads working. I'll investigate if we can not preserve times on the push.
Assignee: oremj → nthomas
Happened again, while testing snippets pushed to beta channel for FF3.0.3. I touched the updates which were reported by QA to not appear, and then QA was able to see them.
(Assignee)

Updated

9 years ago
Priority: -- → P2
(Assignee)

Comment 25

9 years ago
Created attachment 345046 [details] [diff] [review]
[checked in] Exclude directories from rsync's preserve times

From comment #12 and comment #23, we should let the OS update the modification timestamps on the directories, then nfs knows when to invalidate its cache. This patch uses rsync's -O option while keeping the -t for the files (via -a), from the man on aus2-staging:
          -O, --omit-dir-times        omit directories when preserving times

If this doesn't work, then I've got a patch that rsyncs a copy of the snippets without preserving time, then publishes the copy.
Attachment #345046 - Flags: review?(bhearsum)
(Assignee)

Comment 26

9 years ago
Comment on attachment 345046 [details] [diff] [review]
[checked in] Exclude directories from rsync's preserve times

Oh, this also removes the pushd & popd, which have no effect when the rsync and touch use absolute paths.
(Assignee)

Comment 27

9 years ago
(In reply to comment #26)
Better phrasing would be ".. have no effect _because_ the rsync and touch use absolute paths".
Whiteboard: Waiting on a release to test.
Attachment #345046 - Flags: review?(bhearsum) → review+
(Assignee)

Comment 28

9 years ago
Comment on attachment 345046 [details] [diff] [review]
[checked in] Exclude directories from rsync's preserve times

Checking in pushsnip;
/cvsroot/mozilla/tools/release/bin/pushsnip,v  <--  pushsnip
new revision: 1.3; previous revision: 1.2
done

And cvs up'd on aus2-staging.
Attachment #345046 - Attachment description: Exclude directories from rsync's preserve times → [checked in] Exclude directories from rsync's preserve times
(Assignee)

Comment 29

9 years ago
Might as well bring this over to RelEng.
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Whiteboard: Waiting on a release to test
(Assignee)

Comment 30

9 years ago
Assuming this is fixed until I hear otherwise from QA.
Status: REOPENED → RESOLVED
Last Resolved: 9 years ago9 years ago
Resolution: --- → FIXED
(Assignee)

Comment 31

9 years ago
Created attachment 347908 [details] [diff] [review]
Update timestamps for everything before pushing

This is the brute force solution I mentioned in comment #25. It's had some testing but could do with some more.
(Assignee)

Updated

9 years ago
Blocks: 465497
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.