Closed Bug 525937 Opened 15 years ago Closed 14 years ago

Create an SD image version 5 with /builds on new filesystem

Categories

(Release Engineering :: General, defect)

ARM
Maemo
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Assigned: mozilla)

References

Details

Attachments

(4 files, 3 obsolete files)

Much like bug 525037.

We already have /media/mmc2 mounted as another filesystem, so we can nuke and/or re-create swap and the .mozilla profile at boot.

Nuking the unittest directories sometimes gives the same filesystem corruption.  We probably can't get around this entirely, since buildbot will probably need to live on / rather than this new filesystem (unless we re-create that dir at boot too).

So:

- enlarge /media/mmc2
- softlink /builds to /media/mmc2/builds
- move buildbot to /tools/buildbot, update all relevant scripts
also, /standalone.txt instead of /builds/standalone.txt
I will take a look at this before the N810s arrive
Assignee: nobody → jford
+xrestop (see bug 528642)
Blocks: 528642
No longer blocks: 528642
Depends on: 528642
Attached file hearbeat in Py (obsolete) —
a replacement for daily_reboots.sh which runs as a daemon in the background.
Attachment #414554 - Flags: review?(aki)
Attachment #414554 - Attachment is patch: true
Attachment #414554 - Attachment mime type: application/octet-stream → text/plain
Attachment #414554 - Attachment is patch: false
Comment on attachment 414554 [details]
hearbeat in Py

I definitely disagree with the settings, but that's pretty easy to change.  I currently have 1hr w/out buildbot and 24hrs with, and I think the latter is high.  Here it's 4hrs w/out and 40 with.

(I think the max uptime thing is faulty, too, but not sure how to fix it. It's mainly to try to catch hung buildbot processes that are running but aren't connected to a master, often arising from a buildbot master reconfig/restart or other hiccup.)

This will remove the ability to be notified when devices are having issues, but that's not necessarily a part of this script.

So r+ as a script. We should test in staging and compare reliability.  We can disable this via the rc script (launch or don't launch) and can leave the daily_reboot.sh in the image until this + the server side of monitoring are vetted as an improvement.
Attachment #414554 - Flags: review?(aki) → review+
(In reply to comment #5)
> hung buildbot processes that are running but aren't
> connected to a master

This should probably read 'hung buildbot processes that are running and *think* they're connected to a master, but aren't taking further steps/jobs from the master' or something.
Attached file updated script
slight modification to the algorithm.  Changed the uptime with buildbot max to be a distinct value instead of a factor.  

Are you saying that the uptime max with buildbot running is flawed in concept or my implementation?  I agree that it is flawed in concept, but a proper fix would likely need something buildbot side, or else log parsing.  We could see when the last build step completed and if that is too long then reboot.  I don't know if we'd want to be doing this as part of this script, or in any perpetually running script.

For finding issues with the n810s from the master side, we can use the buildslaves scrapper.  If a device is showing up as disconnected on that it is safe to say that it is not actively working.  I just found a python web-scraper toolkit (BeautifulSoup) that I am going to look at to see if i can make the scrapper more robust.

I will be including this script in the upcoming patch for the daemon update.
Attachment #414554 - Attachment is obsolete: true
If a device is showing up as disconnected on that it is
safe to say that it is not actively working

should read

If a device is showing up as disconnected on that it is safe to say that it isn't working on a build.  The scraper may be falsely showing a slave as disconnected when it is really just idle.
(In reply to comment #7)
> Are you saying that the uptime max with buildbot running is flawed in concept
> or my implementation?

Flawed in concept; my implementation as well.
> 
> For finding issues with the n810s from the master side, we can use the
> buildslaves scrapper.  If a device is showing up as disconnected on that it is
> safe to say that it is not actively working.

I'd like it to be a bit more knowledgeable (and tested!) before we switch fully over.
If we had the log server that kept track of

a) what buildbot master the slave thinks it should be connecting to
b) whether it's standalone or not
c) whether its hostname is correct <-- (a little hard to do client-side, but you can report IP and determine if you're maemo-n810-ref at least)
d) a history of when it went down and came back up

etc. then that would probably help here.  The n810-status page is helpful, but frequently not-completely-correct, and I've seen it go blank enough times that we might end up getting 80 false alarm emails on a semi regular basis.

That's probably beyond the scope of this bug, and closer to the scope of the WONTFIXed or otherwise resolved monitoring bug.
I'll let this sit in staging for a while.
Attachment #416650 - Attachment is obsolete: true
Comment on attachment 416666 [details] [diff] [review]
add call to 'false' afterwards, to fail out of the step

Saving us from a lot of tree burning; I'd like to land this so it's official.
Attachment #416666 - Flags: review?(jford)
Attached file list of devices needing help currently (obsolete) —
I ran

  all_maemo_ssh.sh "uptime; ls -l /builds/standalone.txt && cat /builds/standalone.txt" 2>&1 | tee foo

on p-m-m and edited foo.  Going to image these with sd version4.
Attachment #418291 - Attachment mime type: application/octet-stream → text/plain
Blocks: maemo4
Assignee: jhford → aki
Patch to get build/tools to look like what's actually running on maemo-flashing.
Attachment #422851 - Flags: review?(jhford)
Attachment #422851 - Flags: review?(jhford) → review+
Attachment #416666 - Flags: review?(jhford) → review+
Attachment #422851 - Flags: checked-in+
Comment on attachment 416666 [details] [diff] [review]
add call to 'false' afterwards, to fail out of the step

http://hg.mozilla.org/build/buildbotcustom/rev/026bd101db68
Attachment #416666 - Flags: checked-in+
I:

* flashed sd4 onto an sd card
* booted, installed xres debs, cleaned up old logs
* rsync'ed that sd card into /flashing/production-sd/moz-rev-sd-v5/
* flashed sd5 onto an sd card (edited bulk-image.sh)
* booted, verified that it runs w/out theme corruption
* edited the scripts to create a larger /media/mmc2, copy /tools/buildbot to /builds/buildbot, log in /var/log
* moved /builds/buildbot to /tools, softlinked /builds to /media/mmc2.

Now the entirety of /builds (/media/mmc2) is reformatted on boot, which will hopefully get around this filesystem corruption.

maemo-n810-58 is the guinea pig; running on staging for a while.
Attachment #418291 - Attachment is obsolete: true
Comment on attachment 422890 [details] [diff] [review]
create sd ref image version 5

maemo-n810-58 looks like it's chugging along pretty well.
Attachment #422890 - Flags: review?(jhford)
Comment on attachment 422890 [details] [diff] [review]
create sd ref image version 5

looks good!
Attachment #422890 - Flags: review?(jhford) → review+
Comment on attachment 422890 [details] [diff] [review]
create sd ref image version 5

http://hg.mozilla.org/build/tools/rev/66c9b3773570
Attachment #422890 - Flags: checked-in+
Since monitoring is tracked in bug 510952, this bug is *done*.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: