Closed Bug 607928 Opened 14 years ago Closed 13 years ago

linux-ix-slave08 has /builds mounted ro

Categories

(Release Engineering :: General, defect, P3)

x86
All
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: paul.biggar, Unassigned)

References

Details

(Whiteboard: [slaveduty][badslave?][hardware])

I guess there's two parts here, firstly that my build did fail due to count_and_reboot.py failing, and secondly that count_and_reboot.py can fail.

Here's the log:

http://stage.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/pbiggar@mozilla.com-3cb9e5913d42/tryserver-linux/tryserver-linux-build4157.txt.gz

Here's the relevent snippet (I think).

python: can't open file 'tools/buildfarm/maintenance/count_and_reboot.py': [Errno 2] No such file or directory


Finally, the commit was messing around with the Python settings in configure.in, which is unlikely to be related, but possible, so I thought I'd mention it.



Can we fix my build or should I push again?
(In reply to comment #0)
> I guess there's two parts here, firstly that my build did fail due to
> count_and_reboot.py failing, and secondly that count_and_reboot.py can fail.
> 
> Here's the log:
> 
> http://stage.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/pbiggar@mozilla.com-3cb9e5913d42/tryserver-linux/tryserver-linux-build4157.txt.gz
> 
> Here's the relevent snippet (I think).
> 
> python: can't open file 'tools/buildfarm/maintenance/count_and_reboot.py':
> [Errno 2] No such file or directory

looks like there is an issue with the filesystem being remounted read only.

> Finally, the commit was messing around with the Python settings in
> configure.in, which is unlikely to be related, but possible, so I thought I'd
> mention it.

Each block of
========= Started maybe rebooting failed (results: 2, elapsed: 0 secs) ==========
and
======== Finished maybe rebooting failed (results: 2, elapsed: 0 secs) ========

get their own environment and are distinct, unrelated processes forked from the original build slave process.  A change to the python settings in mozilla-central make files wouldn't change the system installation of python which count_and_reboot.py would use.

> Can we fix my build or should I push again?

please repush
Assignee: nobody → jhford
Dmesg output:

[cltbld@linux-ix-slave08 slave]$ dmesg | grep EXT3-fs                           EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
EXT3-fs: mounted filesystem with ordered data mode.
EXT3-fs: error loading journal.
EXT3-fs warning: maximal mount count reached, running e2fsck is recommended
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
EXT3-fs error (device sda4): ext3_free_blocks_sb: bit already cleared for block 37390634
EXT3-fs error (device sda4) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda4) in ext3_truncate: Journal has aborted
EXT3-fs error (device sda4) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda4) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device sda4) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda4) in ext3_delete_inode: Journal has aborted
EXT3-fs error (device sda4): ext3_journal_start_sb: Detected aborted journal
[cltbld@linux-ix-slave08 slave]$ dmesg | grep sda4
 sda: sda1 sda2 sda3 sda4
EXT3 FS on sda4, internal journal
EXT3-fs error (device sda4): ext3_free_blocks_sb: bit already cleared for block 37390634
Aborting journal on device sda4.
EXT3-fs error (device sda4) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda4) in ext3_truncate: Journal has aborted
EXT3-fs error (device sda4) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda4) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device sda4) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda4) in ext3_delete_inode: Journal has aborted
EXT3-fs error (device sda4): ext3_journal_start_sb: Detected aborted journal

I am going to try to fsck this
Summary: Build failure due to count_and_reboot.py failure → linux-ix-slave08 has /builds mounted ro
Looking at buildbot logs, at about 7:10 this slave died during an hg update

/tools/python/bin/hg update --clean --repository /builds/slave/tryserver-linux/build --rev ea98e24fa41b30fc4b9e8154f03c5bb71beddf4e

with 

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]

This build was unable to reboot because the connection was lost and reboot errored out with:

remoteFailed: [Failure instance: Traceback: <type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'callRemote'
...<snip>...
]

Judging by the end time of this build and the start of the next build, this machine did not reboot.  Subsequent commands that tried to modify any date all fail with messages like:

rm: cannot remove `<snip>': Read-only file system

The perfect storm is completed by a failed clone of tools which did not have count and reboot.

Action points to prevent this in future:
-make sure that count_and_reboot.py will work on a r/o filesystem.  I am not sure how it behaves if it can't write to reboot_count.txt
-investigate using fstab option errors=panic to ensure that the system reboots when a failure of the filesystem occurs.  This will kills job but they are jobs that are expected to fail due to the r/o fs.
-stop using haltOnFailure=False unless something is 100% safe to not require.
I just tried to reach this machine but was unable to do so.  We should probably look into whether or not this machine is up.
Assignee: jhford → nobody
Whiteboard: [buildduty]
Priority: -- → P3
Whiteboard: [buildduty] → [buildduty][badslave?][hardware]
Are we trying to debug this, or should we reimage?
Whiteboard: [buildduty][badslave?][hardware] → [slaveduty][badslave?][hardware]
Recent build history indicates that this slave is building fine.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.