Closed
Bug 561442
Opened 14 years ago
Closed 13 years ago
rev3 minis running linux lose hardware/efi clock setting
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jabba, Assigned: dustin)
References
Details
(Whiteboard: [buildslaves][slaveduty])
Attachments
(4 files, 1 obsolete file)
1.51 KB,
patch
|
armenzg
:
review+
|
Details | Diff | Splinter Review |
858 bytes,
patch
|
armenzg
:
review+
|
Details | Diff | Splinter Review |
4.47 KB,
patch
|
armenzg
:
review+
|
Details | Diff | Splinter Review |
2.25 KB,
patch
|
bhearsum
:
review+
|
Details | Diff | Splinter Review |
I've seen this happen on a couple of the rev3 minis already and it looks like more and more are showing the same issues. The hardware clock is somehow getting reset to January 1st, 2001. Once the machine tries to boot, upon mounting the root file system there is an error that the last mount time is too far in the future. The fix is to boot the mini with rEFIt and use the EFI shell to set the date and time. However as this is starting to show up quite a bit, we should investigate the root cause. This could be happening on OSX and Windows machines as well, but we wouldn't know, since those OSs don't fail to boot because of a wrong date and I'm guessing that as soon as they are up there is a time server configured. Can you investigate to see if there is something in the image that is attempting to synchronize the hardware clock on shutdown but is perhaps broken such that it defaults to 2001? It could potentially also be hardware related, so punt the bug back to server ops if nothing wrong can be found with the image. Also, we should track which minis this is happening to in this bug, so that if we do end up having to contact Apple about it, we'll be able to provide serial numbers, etc.
Comment 1•14 years ago
|
||
Affected HostNames: talos-r3-fedt64-014 talos-r3-fedt64-017 talos-r3-fedt64-018 will post more once if I encounter more.
Comment 2•14 years ago
|
||
I reset the clock on 014 using "ntpdate ntp1.build.mozilla.org".
Comment 3•14 years ago
|
||
Comments only mention rev3 fedora64 minis. Is this happening on rev3 fedora32 minis also? (raising priority as this is causing machines to fail to boot)
Severity: normal → major
OS: Mac OS X → Linux
Reporter | ||
Comment 4•14 years ago
|
||
To date, all I can remember is it is happening on fed64 rev3 minis. That is why I think there might be something flaky in the image. Theoretically, it could be hardware related and affecting that entire batch. I think if that were the case it would be showing up in the 32bit minis as well. Don't some distros of Linux synchronize the hardware clock during the shutdown process? Could it be that for some reason these aren't doing that? Just my thoughts.
Comment 5•14 years ago
|
||
fed64-019 Also, not sure if this is related but I have encountered this more than once... also unsure if it is strictly within the fed32 machines but fed-033 Drive failed to mount, no reason specified
Comment 6•14 years ago
|
||
See also bug 557692. Do we need both ?
Comment 7•14 years ago
|
||
talos-r3-fed-037 Can't mount root filesystem mount: you must specify the filesystem type I will make sure to list any other rev 3 fedora 32-bit machines that encounter this same problem
Reporter | ||
Comment 8•14 years ago
|
||
Justin Lazaro, I'm wondering if this is the same error as the time issue. Perhaps you can boot up a liveCD and mount the volume to see if something is corrupt on the FS or if /etc/fstab has become unreadable or corrupt?
Comment 9•14 years ago
|
||
I'll give it a shot next time, I re-imaged these machines as a short-term fix but will debug next time
Updated•14 years ago
|
Whiteboard: [buildslaves]
Comment 10•14 years ago
|
||
fed64-015 fed64-016
Reporter | ||
Comment 12•14 years ago
|
||
I think we can safely say at this point, that the issue only happens on fed64 minis. Perhaps RelEng can figure out a way to force fedora64 to sync the clock before it reboots and then add whatever patch works to the ref image? I think that is kind of the direction this bug is supposed to go now that we've pinned it down to 64-bit images only.
Comment 13•14 years ago
|
||
There are now another 30 new Fedora 64 bit machines (bug 564217) on the pool and there might be some of them having the same problem or loosing the right time (bug 566333).
Comment 14•14 years ago
|
||
talos-r3-fed-027
Comment 15•14 years ago
|
||
talos-r3-fed-040
Comment 16•14 years ago
|
||
hwclock --set --date="mm/dd/yy hh:mm:ss" This is to set the hardware clock manually Could this be added into the automation process so that the clock syncs before the reboot, to avoid the mounting issue?
Comment 17•14 years ago
|
||
Is there a long-term fix in the works? The reboots for bug 585200 are mostly fed/fed64 machines with these issues
Severity: major → critical
Comment 18•14 years ago
|
||
do these machines use refit all the time or only to correct this issue?
Reporter | ||
Comment 19•14 years ago
|
||
Only to correct this issue. If releng can add the line from comment 16 to the reboot script via puppet, I think this would be resolved.
Comment 20•14 years ago
|
||
this patch will run that command before rebooting the machine. I haven't tested this patch, just an idea.
Reporter | ||
Comment 21•14 years ago
|
||
Can you test the patch on some minis and if it has no ill effects, then roll it out? With minis now in Santa Clara, it is more time consuming to have to constantly re-image or fix the fedora machines every time they lose their clock. Almost all of the talos reboots lately are due to this exact issue and it is random and unpredictable so a fix sooner than later would help both IT and RelEng keep wait times low if too many fedora machines experience this at the same time.
Comment 22•14 years ago
|
||
This is going to require the following line to be added to sudoers. Not sure if we manage sudoers through puppet. cltbld ALL=NOPASSWD: /usr/sbin/hwclock and also change the patch to use something like command=['bash', '-c', 'sudo /usr/sbin/hwclock --set --date="$(date +%m/%d/%y\ %H:%M:%S)"'],
Comment 23•14 years ago
|
||
(In reply to comment #16) > hwclock --set --date="mm/dd/yy hh:mm:ss" > > This is to set the hardware clock manually > > Could this be added into the automation process so that the clock syncs before > the reboot, to avoid the mounting issue? Just to confirm, this is run in the Fedora OS and no the rEFIt command prompt?
Comment 24•14 years ago
|
||
Yes, should be run in Fedora
Comment 25•14 years ago
|
||
Comment on attachment 466415 [details] [diff] [review] buildbotcustom patch (idea) >@@ -6560,6 +6568,14 @@ class TalosFactory(BuildFactory): > ) > > def addRebootStep(self): >+ if 'linux' in self.platform: should probably be 'fed' in self.OS >+ self.addStep(ShellCommand( >+ name='set time', >+ description='set time on linux slaves before reboot', >+ alwaysRun=True, >+ command=['bash', '-c', >+ 'hwclock --set --date="$(date +%m/%d/%y\ %H:%M:%S)"'], >+ )) > def do_disconnect(cmd): > try: > if 'SCHEDULED REBOOT' in cmd.logs['stdio'].getText():
Comment 26•14 years ago
|
||
Found during triage. zandr/jabba: would this fix the problem with fedora minis getting messed up when their clocks get out of sync? jhford: is this patch ready for review?
Reporter | ||
Comment 27•14 years ago
|
||
yes, that is correct. At least in theory. Alternatively if the mount command can be convinced not to care about dates in the future, then that would be another solution.
Comment 28•14 years ago
|
||
(In reply to comment #26) > Found during triage. > > zandr/jabba: would this fix the problem with fedora minis getting messed up > when their clocks get out of sync? > > jhford: is this patch ready for review? this patch hasn't been run through testing, but should be quick to stage. As per comment 16, this should fix the problem with the date/time going wonky, which should in turn fix the problem with the machines dieing. I will unbitrot this patch and run it through staging.
Comment 29•14 years ago
|
||
tested this on a unit test run. Will try on talos as well shortly before asking for review
Comment 30•14 years ago
|
||
Buildbot stdio log: bash -c sudo hwclock --set --date="$(date +%m/%d/%y\ %H:%M:%S)" in dir /home/cltbld/talos-slave/test/build (timeout 1200 secs) watching logfiles {} argv: ['bash', '-c', 'sudo hwclock --set --date="$(date +%m/%d/%y\\ %H:%M:%S)"'] environment: <snip> closing stdin using PTY: True program finished with exit code 0 elapsedTime=0.536451
Comment 31•14 years ago
|
||
Also run on talos bash -c sudo hwclock --set --date="$(date +%m/%d/%y\ %H:%M:%S)" in dir /home/cltbld/talos-slave/test/build (timeout 1200 secs) watching logfiles {} argv: ['bash', '-c', 'sudo hwclock --set --date="$(date +%m/%d/%y\\ %H:%M:%S)"'] environment: <snip> closing stdin using PTY: True program finished with exit code 0 elapsedTime=1.517787
Updated•14 years ago
|
Attachment #500829 -
Attachment description: tested for unit tests → tested patch
Comment 32•14 years ago
|
||
Comment on attachment 500829 [details] [diff] [review] tested patch there is a slight issue with this approach. If the time of the clock is set to January 1, 2001 and this command is run, it will not change anything and will not prevent issues rebooting. This is not a major issue because it is asserted that if the clock is set to jan1,2001 the machine will refuse to boot and will be unable to reset this incorrect time. In the worst case scenario where the hwclock is set manually to jan1,2001, no change will have been substantively effected. This patch is ready for review.
Attachment #500829 -
Flags: review?
Comment 33•14 years ago
|
||
(In reply to comment #32) > there is a slight issue with this approach. If the time of the clock is set to > January 1, 2001 and this command is run, it will not change anything and will > not prevent issues rebooting. Are the machines running NTP normally, or setting the clock using ntpdate at boot? If so, I don't see this as a problem.
Updated•14 years ago
|
Attachment #500829 -
Flags: review? → review?(armenzg)
Comment 34•14 years ago
|
||
Comment on attachment 500829 [details] [diff] [review] tested patch The patch looks good. jhford is going to test the command for a fedora64 machine to make sure it works there as well.
Attachment #500829 -
Flags: review?(armenzg) → review+
Comment 35•14 years ago
|
||
(In reply to comment #34) > Comment on attachment 500829 [details] [diff] [review] > tested patch > > The patch looks good. > jhford is going to test the command for a fedora64 machine to make sure it > works there as well. Works on Fedora64
Flags: needs-reconfig?
Comment 36•14 years ago
|
||
bash -c sudo hwclock --set --date="$(date +%m/%d/%y\ %H:%M:%S)" in dir /home/cltbld/talos-slave/test/build (timeout 1200 secs) watching logfiles {} argv: ['bash', '-c', 'sudo hwclock --set --date="$(date +%m/%d/%y\\ %H:%M:%S)"'] environment: <snip> HOSTNAME=talos-r3-fed-045.build.mozilla.org <snip> closing stdin using PTY: True program finished with exit code 0 elapsedTime=0.525749 http://hg.mozilla.org/build/buildbotcustom/rev/2f9f1a47b238
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 37•14 years ago
|
||
Did you consider adding the hwclock call to count_and_reboot.py ? Only one place to maintain it there.
Updated•14 years ago
|
Flags: needs-reconfig?
Comment 38•14 years ago
|
||
(In reply to comment #37) > Did you consider adding the hwclock call to count_and_reboot.py ? Only one > place to maintain it there. No, I didn't think about putting it there. That script is run on all of our systems and beyond looking at a file on the filesystem, the script wouldn't know if it was on a fedora or centos machine based on os.name. If we had a consolidated class hierarchy for all test runs, we would only have one place to maintain this code.
Comment 39•14 years ago
|
||
Causing burning on 1.9.2 at least: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1294696296.1294696982.17907.gz Did the sudoers file get updated on build and test boxen?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 40•14 years ago
|
||
(In reply to comment #38) Turns out we need to handle build machines because tests run there on 1.9.1 and 1.9.2 anyway.
Comment 41•14 years ago
|
||
I added a message pointing to this bug in the status message for the Firefox3.6 and Firefox3.5 trees, since someone actually asked before just pushing into the permared. Wonders never cease. Please remove it when this is fixed, since I probably won't remember.
Comment 42•14 years ago
|
||
(In reply to comment #40) I mis-spoke slightly here. We want to run hwclock on talos-r3-fed* but not the compilation slaves that do tests on 1.9.1 and 1.9.2. How about a hostname check in the reboot script ?
Comment 43•14 years ago
|
||
I am testing a patch that will allow us to selectively enable the hwclock reset from buildbot-configs.
Comment 44•14 years ago
|
||
add required configuration entries
Comment 45•14 years ago
|
||
buildbotcustom patch to allow enabling and disabling of the hwclock steps. This patch assumes that all builder masters are running on non-minis and acts accordingly by setting the default for build masters to be not to set the hwclock
Updated•14 years ago
|
Attachment #503201 -
Flags: review?(armenzg)
Updated•14 years ago
|
Attachment #503202 -
Flags: review?(armenzg)
Comment 46•14 years ago
|
||
Comment on attachment 503202 [details] [diff] [review] buildbotcustom This patch is well done. I would like you to address a couple of nits and/or answer the questions. r+ with the nits/questions. >diff --git a/misc.py b/misc.py >--- a/misc.py >+++ b/misc.py >@@ -1032,7 +1034,7 @@ def generateBranchObjects(config, name): > config, name, platform, "%s debug test" % base_name, > "%s-%s-unittest" % (name, platform), > suites_name, suites, mochitestLeakThreshold, >- crashtestLeakThreshold)) >+ crashtestLeakThreshold, resetHwClock=False)) > continue > > if config['enable_nightly']: Do you need to add this if the default in the factory is already False? I run it without it through ./test-master.sh -8 and it seemed to be fine. >@@ -2414,13 +2416,14 @@ def generateTalosBranchObjects(branch, b > triggeredUnittestBuilders.append(('tests-%s-%s-%s-unittest' % (branch, slave_platform, test_type), test_builders, merge_tests)) > > for suites_name, suites in branch_config['platforms'][platform][slave_platform][unittest_suites]: >+ resetHwClock = branch_config['platforms'][platform][slave_platform].get('reset_hw_clock', False) > # create the builders > branchObjects['builders'].extend(generateTestBuilder( > branch_config, branch, platform, "%s %s %s test" % (platform_name, branch, test_type), > "%s_%s_test" % (branch, slave_platform_name), > suites_name, suites, branch_config.get('mochitest_leak_threshold', None), > branch_config.get('crashtest_leak_threshold', None), >- platform_config[slave_platform]['slaves'])) >+ platform_config[slave_platform]['slaves'], resetHwClock=resetHwClock)) > > for scheduler_name, test_builders, merge in triggeredUnittestBuilders: > for test in test_builders: If you are not reusing resetHwClock do it like this: >- platform_config[slave_platform]['slaves'])) >+ platform_config[slave_platform]['slaves'], >+ resetHwClock = branch_config['platforms'][platform][slave_platform].get('reset_hw_clock', False))
Attachment #503202 -
Flags: review?(armenzg) → review+
Updated•14 years ago
|
Attachment #503201 -
Flags: review?(armenzg) → review+
Comment 47•14 years ago
|
||
Comment on attachment 503202 [details] [diff] [review] buildbotcustom >diff --git a/misc.py b/misc.py >--- a/misc.py >+++ b/misc.py >@@ -6853,7 +6854,7 @@ class MozillaTestFactory(MozillaBuildFac > if self.buildsBeforeReboot and self.buildsBeforeReboot > 0: > #This step is to deal with minis running linux that don't reboot properly > #see bug561442 >- if 'linux' in self.platform: >+ if self.resetHwClock and 'linux' in self.platform: > self.addStep(ShellCommand( > name='set_time', > description=['set', 'time'], Can you add a comment that we set self.resetHwClock to True only for the Fedora/minis? In fact, would we need this "and 'linux' in self.platform" anymore?
Comment 48•14 years ago
|
||
(In reply to comment #46) > Comment on attachment 503202 [details] [diff] [review] > buildbotcustom > > This patch is well done. > I would like you to address a couple of nits and/or answer the questions. > > r+ with the nits/questions. > >+ crashtestLeakThreshold, resetHwClock=False)) > Do you need to add this if the default in the factory is already False? > I run it without it through ./test-master.sh -8 and it seemed to be fine. I don't need to have it here, but I include it to be explicit. I don't have a strong preference and am ok to remove it. > > for suites_name, suites in branch_config['platforms'][platform][slave_platform][unittest_suites]: > >+ resetHwClock = branch_config['platforms'][platform][slave_platform].get('reset_hw_clock', False) > >- platform_config[slave_platform]['slaves'])) > >+ platform_config[slave_platform]['slaves'], resetHwClock=resetHwClock)) > > > > for scheduler_name, test_builders, merge in triggeredUnittestBuilders: > > for test in test_builders: > > If you are not reusing resetHwClock do it like this: > >- platform_config[slave_platform]['slaves'])) > >+ platform_config[slave_platform]['slaves'], > >+ resetHwClock = branch_config['platforms'][platform][slave_platform].get('reset_hw_clock', False)) While I am not reusing the variable, I intentionally had it on a second line to make the call more legible. I can certainly move the dictionary lookup to the call. > >- if 'linux' in self.platform: > >+ if self.resetHwClock and 'linux' in self.platform: > Can you add a comment that we set self.resetHwClock to True only for the > Fedora/minis? > In fact, would we need this "and 'linux' in self.platform" anymore? It will not break if we remove the check for linux platform, but those steps are only valid for linux machines so the extra check is safety padding.
Comment 50•13 years ago
|
||
verified that this is fixed
Status: REOPENED → RESOLVED
Closed: 14 years ago → 13 years ago
Resolution: --- → FIXED
Comment 51•13 years ago
|
||
FIXED might be a bit premature. We've still had lots of fed and fed64 boxes fall over since this was first deployed (jan 6th). Lets see what states zandr finds when he deals with the latest batch of machines that fell over.
Reporter | ||
Comment 52•13 years ago
|
||
Yeah, if this fix didn't work, then the problem is likely more severe and the next thing to try is seeing if there is a mount option that ignores dates, and push that out.
Comment 53•13 years ago
|
||
I'd rather treat the underlying disease rather than the symptom, but that may be lost down in the interaction between linux and the mac hardware.
Updated•13 years ago
|
Flags: needs-reconfig?
Comment 54•13 years ago
|
||
http://forums.debian.net/viewtopic.php?f=10&t=45797 describes the options available to us. We're running e2fsprogs 1.41.9 on the talos slaves. 1.41.10 has a config option that will let us fix this, though it's challenging to deploy. (the config file needs to be on the initrd) I think that the ugly config options described in http://forums.debian.net/viewtopic.php?f=10&t=45797#p261964 will work with 1.41.9, but someone will have to try it to find out.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 55•13 years ago
|
||
Throwing this back into the pool as this is new work
Assignee: jhford → nobody
Severity: critical → major
Priority: P5 → --
Assignee | ||
Comment 56•13 years ago
|
||
Moving to Server Ops: RelEng in hopes of implementing comment 54.
Assignee: nobody → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Whiteboard: [buildslaves] → [buildslaves][slaveduty]
Comment 57•13 years ago
|
||
This package was updated in Fedora 13 and we should be able to rebuild http://mirror02.ipgn.com.au/fedora/linux/releases/13/Everything/source/SRPMS/e2fsprogs-1.41.10-6.fc13.src.rpm on a fedora 12 box
Assignee | ||
Comment 58•13 years ago
|
||
I'm going to try to build this using the R2 mini I have at home, and a basic (non-Mozilla) install of F12 on it. If I get a working, updated initrd, I'll hand this back to someone onsite to try out on a Mozilla system.
Assignee: server-ops-releng → dustin
Assignee | ||
Comment 59•13 years ago
|
||
Hm, I downloaded the i386 F12 netinst CD, and it doesn't boot on this r2 mini. Tips on the quickest way from zero to running F12 on an r2 mini? Clonezilla? Different install disk? Ship one to me? They are small, after all.
Comment 60•13 years ago
|
||
you need to do a dvd install followed by a refit parition sync followed by a grub fix from the fedora sysrescue image
Assignee | ||
Comment 61•13 years ago
|
||
I installed the F12 image with clonezilla, and I can't get it to boot. I can't tell how much of that is due to the bad USB ports, and how much is due to installing an image designed for r3 minis on an r2 mini. I'm in conversation with zandr as to how best to work around this.
Assignee | ||
Comment 62•13 years ago
|
||
Well, we put a 1U remote KVM on it, but that doesn't include power control, so it's not particularly practical. Zandr's going to mail me an r3 mini to work on, and then I'll mail it back when that's done.
Status: REOPENED → ASSIGNED
Comment 63•13 years ago
|
||
Shipped today for Tuesday delivery.
Assignee | ||
Comment 64•13 years ago
|
||
Received. I'll get to work on this ASAP.
Assignee | ||
Comment 65•13 years ago
|
||
This was amazingly easy :)
Attachment #535171 -
Flags: review?(bhearsum)
Comment 66•13 years ago
|
||
Comment on attachment 535171 [details] [diff] [review] m561442-puppet-manfiests-p1-r1.patch Hmm, you include the boot class in the masters, CentOS, and Fedora but inside of it you only do anything for Fedora. Is that intentional?
Assignee | ||
Comment 67•13 years ago
|
||
Yes, they all need to boot! It just happens that everything but CentOS does it without assistance from puppet :) I put it everywhere 'include network' appeared. It occurs to me that should appear for mac hosts, too, but I'm not going to fix that here.
Updated•13 years ago
|
Attachment #535171 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 68•13 years ago
|
||
Committed and deployed. I'm calling this done until I hear otherwise during a colo trip.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Comment 69•13 years ago
|
||
I'll let you know if I see any date problems on machines not already on the reboots bug.
Comment 70•13 years ago
|
||
(In reply to comment #65) > Created attachment 535171 [details] [diff] [review] [review] > m561442-puppet-manfiests-p1-r1.patch > > This was amazingly easy :) :-D (In reply to comment #69) > I'll let you know if I see any date problems on machines not already on the > reboots bug. Please do. Hopefully this means fewer linux test machine headaches. Fingers crossed.
Comment 71•13 years ago
|
||
It's only been a week, but so far, so good.
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•