Closed Bug 561442 Opened 14 years ago Closed 13 years ago

rev3 minis running linux lose hardware/efi clock setting

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
Linux
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jabba, Assigned: dustin)

References

Details

(Whiteboard: [buildslaves][slaveduty])

Attachments

(4 files, 1 obsolete file)

I've seen this happen on a couple of the rev3 minis already and it looks like more and more are showing the same issues. The hardware clock is somehow getting reset to January 1st, 2001. Once the machine tries to boot, upon mounting the root file system there is an error that the last mount time is too far in the future.

The fix is to boot the mini with rEFIt and use the EFI shell to set the date and time.

However as this is starting to show up quite a bit, we should investigate the root cause.

This could be happening on OSX and Windows machines as well, but we wouldn't know, since those OSs don't fail to boot because of a wrong date and I'm guessing that as soon as they are up there is a time server configured.

Can you investigate to see if there is something in the image that is attempting to synchronize the hardware clock on shutdown but is perhaps broken such that it defaults to 2001? It could potentially also be hardware related, so punt the bug back to server ops if nothing wrong can be found with the image. Also, we should track which minis this is happening to in this bug, so that if we do end up having to contact Apple about it, we'll be able to provide serial numbers, etc.
Affected HostNames:

talos-r3-fedt64-014
talos-r3-fedt64-017
talos-r3-fedt64-018

will post more once if I encounter more.
I reset the clock on 014 using "ntpdate ntp1.build.mozilla.org".
Comments only mention rev3 fedora64 minis. Is this happening on rev3 fedora32 minis also?

(raising priority as this is causing machines to fail to boot)
Severity: normal → major
OS: Mac OS X → Linux
To date, all I can remember is it is happening on fed64 rev3 minis. That is why I think there might be something flaky in the image. Theoretically, it could be hardware related and affecting that entire batch. I think if that were the case it would be showing up in the 32bit minis as well. Don't some distros of Linux synchronize the hardware clock during the shutdown process? Could it be that for some reason these aren't doing that? Just my thoughts.
fed64-019

Also, not sure if this is related but I have encountered this more than once... also unsure if it is strictly within the fed32 machines but

fed-033

Drive failed to mount, no reason specified
See also bug 557692. Do we need both ?
talos-r3-fed-037

Can't mount root filesystem
mount: you must specify the filesystem type

I will make sure to list any other rev 3 fedora 32-bit machines that encounter this same problem
Justin Lazaro, I'm wondering if this is the same error as the time issue. Perhaps you can boot up a liveCD and mount the volume to see if something is corrupt on the FS or if /etc/fstab has become unreadable or corrupt?
I'll give it a shot next time, I re-imaged these machines as a short-term fix but will debug next time
Whiteboard: [buildslaves]
fed64-015
fed64-016
This doesn't happen often AFAIK, lowering priority.
Priority: -- → P5
I think we can safely say at this point, that the issue only happens on fed64 minis. Perhaps RelEng can figure out a way to force fedora64 to sync the clock before it reboots and then add whatever patch works to the ref image? I think that is kind of the direction this bug is supposed to go now that we've pinned it down to 64-bit images only.
There are now another 30 new Fedora 64 bit machines (bug 564217) on the pool and there might be some of them having the same problem or loosing the right time (bug 566333).
talos-r3-fed-027
talos-r3-fed-040
hwclock --set --date="mm/dd/yy hh:mm:ss"

This is to set the hardware clock manually

Could this be added into the automation process so that the clock syncs before the reboot, to avoid the mounting issue?
Is there a long-term fix in the works?  The reboots for bug 585200 are mostly fed/fed64 machines with these issues
Severity: major → critical
do these machines use refit all the time or only to correct this issue?
Only to correct this issue. If releng can add the line from comment 16 to the reboot script via puppet, I think this would be resolved.
Attached patch buildbotcustom patch (idea) (obsolete) — Splinter Review
this patch will run that command before rebooting the machine.

I haven't tested this patch, just an idea.
Can you test the patch on some minis and if it has no ill effects, then roll it out? 

With minis now in Santa Clara, it is more time consuming to have to constantly re-image or fix the fedora machines every time they lose their clock. 

Almost all of the talos reboots lately are due to this exact issue and it is random and unpredictable so a fix sooner than later would help both IT and RelEng keep wait times low if too many fedora machines experience this at the same time.
This is going to require the following line to be added to sudoers.  Not sure if we manage sudoers through puppet.

cltbld ALL=NOPASSWD: /usr/sbin/hwclock

and also change the patch to use something like

  command=['bash', '-c',
    'sudo /usr/sbin/hwclock --set --date="$(date +%m/%d/%y\ %H:%M:%S)"'],
(In reply to comment #16)
> hwclock --set --date="mm/dd/yy hh:mm:ss"
> 
> This is to set the hardware clock manually
> 
> Could this be added into the automation process so that the clock syncs before
> the reboot, to avoid the mounting issue?

Just to confirm, this is run in the Fedora OS and no the rEFIt command prompt?
Yes, should be run in Fedora
Comment on attachment 466415 [details] [diff] [review]
buildbotcustom patch (idea)

>@@ -6560,6 +6568,14 @@ class TalosFactory(BuildFactory):
>         )
> 
>     def addRebootStep(self):
>+        if 'linux' in self.platform:

should probably be 'fed' in self.OS

>+            self.addStep(ShellCommand(
>+                name='set time',
>+                description='set time on linux slaves before reboot',
>+                alwaysRun=True,
>+                command=['bash', '-c',
>+                         'hwclock --set --date="$(date +%m/%d/%y\ %H:%M:%S)"'],
>+            ))
>         def do_disconnect(cmd):
>             try:
>                 if 'SCHEDULED REBOOT' in cmd.logs['stdio'].getText():
Found during triage.

zandr/jabba: would this fix the problem with fedora minis getting messed up when their clocks get out of sync?

jhford: is this patch ready for review?
yes, that is correct. At least in theory. Alternatively if the mount command can be convinced not to care about dates in the future, then that would be another solution.
(In reply to comment #26)
> Found during triage.
> 
> zandr/jabba: would this fix the problem with fedora minis getting messed up
> when their clocks get out of sync?
> 
> jhford: is this patch ready for review?

this patch hasn't  been run through testing, but should be quick to stage.   As per comment 16, this should fix the problem with the date/time going wonky, which should in turn fix the problem with the machines dieing.  I will unbitrot this patch and run it through staging.
Attached patch tested patchSplinter Review
tested this on a unit test run.  Will try on talos as well shortly before asking for review
Assignee: nobody → jhford
Attachment #466415 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Buildbot stdio log:

bash -c sudo hwclock --set --date="$(date +%m/%d/%y\ %H:%M:%S)"
 in dir /home/cltbld/talos-slave/test/build (timeout 1200 secs)
 watching logfiles {}
 argv: ['bash', '-c', 'sudo hwclock --set --date="$(date +%m/%d/%y\\ %H:%M:%S)"']
 environment:
<snip>
 closing stdin
 using PTY: True
program finished with exit code 0
elapsedTime=0.536451
Also run on talos

bash -c sudo hwclock --set --date="$(date +%m/%d/%y\ %H:%M:%S)"
 in dir /home/cltbld/talos-slave/test/build (timeout 1200 secs)
 watching logfiles {}
 argv: ['bash', '-c', 'sudo hwclock --set --date="$(date +%m/%d/%y\\ %H:%M:%S)"']
 environment:
<snip>
 closing stdin
 using PTY: True
program finished with exit code 0
elapsedTime=1.517787
Attachment #500829 - Attachment description: tested for unit tests → tested patch
Comment on attachment 500829 [details] [diff] [review]
tested patch

there is a slight issue with this approach.  If the time of the clock is set to January 1, 2001 and this command is run, it will not change anything and will not prevent issues rebooting.  This is not a major issue because it is asserted that if the clock is set to jan1,2001 the machine will refuse to boot and will be unable to reset this incorrect time.  In the worst case scenario where the hwclock is set manually to jan1,2001, no change will have been substantively effected.

This patch is ready for review.
Attachment #500829 - Flags: review?
(In reply to comment #32)
> there is a slight issue with this approach.  If the time of the clock is set to
> January 1, 2001 and this command is run, it will not change anything and will
> not prevent issues rebooting.  

Are the machines running NTP normally, or setting the clock using ntpdate at boot?

If so, I don't see this as a problem.
Attachment #500829 - Flags: review? → review?(armenzg)
Comment on attachment 500829 [details] [diff] [review]
tested patch

The patch looks good.
jhford is going to test the command for a fedora64 machine to make sure it works there as well.
Attachment #500829 - Flags: review?(armenzg) → review+
(In reply to comment #34)
> Comment on attachment 500829 [details] [diff] [review]
> tested patch
> 
> The patch looks good.
> jhford is going to test the command for a fedora64 machine to make sure it
> works there as well.

Works on Fedora64
Flags: needs-reconfig?
bash -c sudo hwclock --set --date="$(date +%m/%d/%y\ %H:%M:%S)"
 in dir /home/cltbld/talos-slave/test/build (timeout 1200 secs)
 watching logfiles {}
 argv: ['bash', '-c', 'sudo hwclock --set --date="$(date +%m/%d/%y\\ %H:%M:%S)"']
 environment:
<snip>
  HOSTNAME=talos-r3-fed-045.build.mozilla.org
<snip>
 closing stdin
 using PTY: True
program finished with exit code 0
elapsedTime=0.525749



http://hg.mozilla.org/build/buildbotcustom/rev/2f9f1a47b238
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Did you consider adding the hwclock call to count_and_reboot.py ? Only one place to maintain it there.
Flags: needs-reconfig?
(In reply to comment #37)
> Did you consider adding the hwclock call to count_and_reboot.py ? Only one
> place to maintain it there.

No, I didn't think about putting it there.  That script is run on all of our systems and beyond looking at a file on the filesystem, the script wouldn't know if it was on a fedora or centos machine based on os.name.  If we had a consolidated class hierarchy for all test runs, we would only have one place to maintain this code.
Causing burning on 1.9.2 at least: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1294696296.1294696982.17907.gz

Did the sudoers file get updated on build and test boxen?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #38)

Turns out we need to handle build machines because tests run there on 1.9.1 and 1.9.2 anyway.
I added a message pointing to this bug in the status message for the Firefox3.6 and Firefox3.5 trees, since someone actually asked before just pushing into the permared. Wonders never cease.

Please remove it when this is fixed, since I probably won't remember.
(In reply to comment #40)
I mis-spoke slightly here. We want to run hwclock on talos-r3-fed* but not the compilation slaves that do tests on 1.9.1 and 1.9.2. How about a hostname check in the reboot script ?
I am testing a patch that will allow us to selectively enable the hwclock reset from buildbot-configs.
Attached patch buildbot-configsSplinter Review
add required configuration entries
Attached patch buildbotcustomSplinter Review
buildbotcustom patch to allow enabling and disabling of the hwclock steps.  This patch assumes that all builder masters are running on non-minis and acts accordingly by setting the default for build masters to be not to set the hwclock
Attachment #503201 - Flags: review?(armenzg)
Attachment #503202 - Flags: review?(armenzg)
Comment on attachment 503202 [details] [diff] [review]
buildbotcustom

This patch is well done.
I would like you to address a couple of nits and/or answer the questions.

r+ with the nits/questions.

>diff --git a/misc.py b/misc.py
>--- a/misc.py
>+++ b/misc.py
>@@ -1032,7 +1034,7 @@ def generateBranchObjects(config, name):
>                         config, name, platform, "%s debug test" % base_name,
>                         "%s-%s-unittest" % (name, platform),
>                         suites_name, suites, mochitestLeakThreshold,
>-                        crashtestLeakThreshold))
>+                        crashtestLeakThreshold, resetHwClock=False))
>             continue
> 
>         if config['enable_nightly']:
Do you need to add this if the default in the factory is already False?
I run it without it through ./test-master.sh -8 and it seemed to be fine.

>@@ -2414,13 +2416,14 @@ def generateTalosBranchObjects(branch, b
>                     triggeredUnittestBuilders.append(('tests-%s-%s-%s-unittest' % (branch, slave_platform, test_type), test_builders, merge_tests))
> 
>                     for suites_name, suites in branch_config['platforms'][platform][slave_platform][unittest_suites]:
>+                        resetHwClock = branch_config['platforms'][platform][slave_platform].get('reset_hw_clock', False)
>                         # create the builders
>                         branchObjects['builders'].extend(generateTestBuilder(
>                                 branch_config, branch, platform, "%s %s %s test" % (platform_name, branch, test_type),
>                                 "%s_%s_test" % (branch, slave_platform_name),
>                                 suites_name, suites, branch_config.get('mochitest_leak_threshold', None),
>                                 branch_config.get('crashtest_leak_threshold', None),
>-                                platform_config[slave_platform]['slaves']))
>+                                platform_config[slave_platform]['slaves'], resetHwClock=resetHwClock))
> 
>                     for scheduler_name, test_builders, merge in triggeredUnittestBuilders:
>                         for test in test_builders:

If you are not reusing resetHwClock do it like this:
>-                                platform_config[slave_platform]['slaves']))
>+                                platform_config[slave_platform]['slaves'],
>+                                resetHwClock = branch_config['platforms'][platform][slave_platform].get('reset_hw_clock', False))
Attachment #503202 - Flags: review?(armenzg) → review+
Attachment #503201 - Flags: review?(armenzg) → review+
Comment on attachment 503202 [details] [diff] [review]
buildbotcustom

>diff --git a/misc.py b/misc.py
>--- a/misc.py
>+++ b/misc.py
>@@ -6853,7 +6854,7 @@ class MozillaTestFactory(MozillaBuildFac
>         if self.buildsBeforeReboot and self.buildsBeforeReboot > 0:
>             #This step is to deal with minis running linux that don't reboot properly
>             #see bug561442
>-            if 'linux' in self.platform:
>+            if self.resetHwClock and 'linux' in self.platform:
>                 self.addStep(ShellCommand(
>                     name='set_time',
>                     description=['set', 'time'],
Can you add a comment that we set self.resetHwClock to True only for the Fedora/minis?
In fact, would we need this "and 'linux' in self.platform" anymore?
(In reply to comment #46)
> Comment on attachment 503202 [details] [diff] [review]
> buildbotcustom
> 
> This patch is well done.
> I would like you to address a couple of nits and/or answer the questions.
> 
> r+ with the nits/questions.
> >+                        crashtestLeakThreshold, resetHwClock=False))
> Do you need to add this if the default in the factory is already False?
> I run it without it through ./test-master.sh -8 and it seemed to be fine.

I don't need to have it here, but I include it to be explicit.  I don't have a strong preference and am ok to remove it.

> >                     for suites_name, suites in branch_config['platforms'][platform][slave_platform][unittest_suites]:
> >+                        resetHwClock = branch_config['platforms'][platform][slave_platform].get('reset_hw_clock', False)
> >-                                platform_config[slave_platform]['slaves']))
> >+                                platform_config[slave_platform]['slaves'], resetHwClock=resetHwClock))
> > 
> >                     for scheduler_name, test_builders, merge in triggeredUnittestBuilders:
> >                         for test in test_builders:
> 
> If you are not reusing resetHwClock do it like this:
> >-                                platform_config[slave_platform]['slaves']))
> >+                                platform_config[slave_platform]['slaves'],
> >+                                resetHwClock = branch_config['platforms'][platform][slave_platform].get('reset_hw_clock', False))

While I am not reusing the variable, I intentionally had it on a second line to make the call more legible.  I can certainly move the dictionary lookup to the call.

> >-            if 'linux' in self.platform:
> >+            if self.resetHwClock and 'linux' in self.platform:
> Can you add a comment that we set self.resetHwClock to True only for the
> Fedora/minis?
> In fact, would we need this "and 'linux' in self.platform" anymore?

It will not break if we remove the check for linux platform, but those steps are only valid for linux machines so the extra check is safety padding.
passed on preproduction-master
Flags: needs-reconfig?
verified that this is fixed
Status: REOPENED → RESOLVED
Closed: 14 years ago13 years ago
Resolution: --- → FIXED
FIXED might be a bit premature. We've still had lots of fed and fed64 boxes fall over since this was first deployed (jan 6th). Lets see what states zandr finds when he deals with the latest batch of machines that fell over.
Yeah, if this fix didn't work, then the problem is likely more severe and the next thing to try is seeing if there is a mount option that ignores dates, and push that out.
I'd rather treat the underlying disease rather than the symptom, but that may be lost down in the interaction between linux and the mac hardware.
Flags: needs-reconfig?
http://forums.debian.net/viewtopic.php?f=10&t=45797 describes the options available to us. 

We're running e2fsprogs 1.41.9 on the talos slaves. 1.41.10 has a config option that will let us fix this, though it's challenging to deploy. (the config file needs to be on the initrd)

I think that the ugly config options described in http://forums.debian.net/viewtopic.php?f=10&t=45797#p261964 will work with 1.41.9, but someone will have to try it to find out.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Throwing this back into the pool as this is new work
Assignee: jhford → nobody
Severity: critical → major
Priority: P5 → --
Moving to Server Ops: RelEng in hopes of implementing comment 54.
Assignee: nobody → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Whiteboard: [buildslaves] → [buildslaves][slaveduty]
This package was updated in Fedora 13 and we should be able to rebuild http://mirror02.ipgn.com.au/fedora/linux/releases/13/Everything/source/SRPMS/e2fsprogs-1.41.10-6.fc13.src.rpm on a fedora 12 box
I'm going to try to build this using the R2 mini I have at home, and a basic (non-Mozilla) install of F12 on it.  If I get a working, updated initrd, I'll hand this back to someone onsite to try out on a Mozilla system.
Assignee: server-ops-releng → dustin
Hm, I downloaded the i386 F12 netinst CD, and it doesn't boot on this r2 mini.  Tips on the quickest way from zero to running F12 on an r2 mini?  Clonezilla?  Different install disk?  Ship one to me?  They are small, after all.
you need to do a dvd install followed by a refit parition sync followed by a grub fix from the fedora sysrescue image
I installed the F12 image with clonezilla, and I can't get it to boot.  I can't tell how much of that is due to the bad USB ports, and how much is due to installing an image designed for r3 minis on an r2 mini.

I'm in conversation with zandr as to how best to work around this.
Well, we put a 1U remote KVM on it, but that doesn't include power control, so it's not particularly practical.  Zandr's going to mail me an r3 mini to work on, and then I'll mail it back when that's done.
Status: REOPENED → ASSIGNED
Shipped today for Tuesday delivery.
Received.  I'll get to work on this ASAP.
This was amazingly easy :)
Attachment #535171 - Flags: review?(bhearsum)
Comment on attachment 535171 [details] [diff] [review]
m561442-puppet-manfiests-p1-r1.patch

Hmm, you include the boot class in the masters, CentOS, and Fedora but inside of it you only do anything for Fedora. Is that intentional?
Yes, they all need to boot!  It just happens that everything but CentOS does it without assistance from puppet :)

I put it everywhere 'include network' appeared.  It occurs to me that should appear for mac hosts, too, but I'm not going to fix that here.
Attachment #535171 - Flags: review?(bhearsum) → review+
Committed and deployed.  I'm calling this done until I hear otherwise during a colo trip.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
I'll let you know if I see any date problems on machines not already on the reboots bug.
(In reply to comment #65)
> Created attachment 535171 [details] [diff] [review] [review]
> m561442-puppet-manfiests-p1-r1.patch
> 
> This was amazingly easy :)

:-D

(In reply to comment #69)
> I'll let you know if I see any date problems on machines not already on the
> reboots bug.
Please do. Hopefully this means fewer linux test machine headaches. Fingers crossed.
It's only been a week, but so far, so good.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: