Closed Bug 305131 Opened 19 years ago Closed 19 years ago

Mac l10n tinderbox builds are failing due to a stale mountpoint (mount-temp) that is used in the packaging process

Categories

(Firefox Build System :: General, defect)

PowerPC
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: mark)

References

()

Details

(Keywords: fixed1.8)

Attachments

(4 files, 1 obsolete file)

ref: http://lxr.mozilla.org/mozilla/source/toolkit/mozapps/installer/packager.mk#103

The directory mount-temp is created as a mountpoint for the Mac disk image.
However, something seems to be going wrong with the initial mount, because even
on the first attempt after a reboot, the mountpoint is returning I/O errors.
Subsequent mount and umount attempts also fail.

Note: after a reboot, I have tried running the same series of hdiutil steps from
UNMAKE_PACKAGE in packager.mk by hand, and they have succeeded.
please don't use webtools:tinderbox for issues w/ specific tinderboxes, this
component is for the tinderbox software itself. if your problem is w/ the core
build process, file a bug there, if your problem is w/ specific tinderboxes, use
mozilla.org:tinderbox...
Assignee: mcafee → nobody
Component: Tinderbox → Build Config
Product: Webtools → Core
QA Contact: timeless → build-config
when did this stop working? There doesn't seem to be anything all that
suspicious in the packager.mk changes except for the new DMG-builder... could
that have caused this?
Using the fr builds as my guide post
(http://tinderbox.mozilla.org/showbuilds.cgi?tree=Mozilla-l10n-fr), it looks
like this repackaging error started appearing on 2005/08/10 19:01:00

http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-l10n-fr/1123725660.2096.gz&fulltext=1

This may or  may not be coincidence, but isn't that almost exactly the same time
that we branched for 1.8, or at least when the new tinderbox builds for
Mozilla1.8 were set up? None of the Mac Mozilla1.8-l10n-* builds have ever
worked from what I can see.
This seems to have happened on 0810.

http://tinderbox.mozilla.org/showbuilds.cgi?tree=Mozilla-l10n-fr&hours=24&maxdate=1123750800&legend=0

As you can see, there was trouble on all of the builds.  When it cleared up, the
Mac recovered for one cycle before losing its mind.

Note that all of the green cycles before its last one had disk images on disk1,
during the last green it was on disk2.  If this happens again, let's start with
"diskutil list" and take it from there.  The DiskImages and DiskArbitration
frameworks can be really finicky, although I've never seen the system hold such
a death grip on mount points that they're not released after rebooting.

In my experience, it's better to not specify your own mount points (hdiutil
attach -mountpoint) when working with hdiutil.  I've seen synchronicity problems
doing that, and as we're seeing here, it's rough to get mount points back when
the system decides to take its ball and go home.  It's better to let the system
mount the disk where it wants, in a fresh mount point in /Volumes.

I initially wanted to do private non-/Volumes mounts in the new dmg packager
too, but gave up because it's really not workable.  Maybe UNMAKE_PACKAGE should
follow suit.
I've rebooted maya again, and we'll see how this next build cycle turns out. If
we're still seeing mountpoint issues, we can try out this patch.
The proposed patch will fall down whenever there's a mount point with a space in
it.  Even if we don't support volume names that are supposed to have spaces in
them (I think we should), this will break as soon as the desired mount point is
in use and the system decides to start appending numbers (/Volumes/Firefox 1). 
Since we're trying to avoid trouble when mount points go stale, we should keep
this in mind and account for it.

Also, this will most likely stall the build now that we've potentially got EULAs
in disk images.  hdiutil dumps the license into $PAGER's input pipe, using less
if there is no $PAGER, asks whether you agree on stdout, and waits for a yes or
no answer on stdin.  So, we've got to get really cute with our attach invocation:

echo Y | PAGER=true hdiutil attach -readonly -private -noautoopen $(UNPACKAGE)

When it comes time to extract $(DEV_NAME) we've got to be wary of the prompt
that hdiutil may have also put on stdout.  We've got to match the /dev/disk
pattern instead of awking for the first field.  So instead of the sed-awk pipe,
something like this:

sed -e 's/^.*\(\/dev\/disk[^ ]*\).*$/\1/;1q'
I would *really* like to continue using explicit mountpoints unless we can
document that it doesn't work: it seems like using the "echo Y | hdiutil" is
what we need to do here (yuck... there should be a hdiutil -y flag or something).
*** Bug 305470 has been marked as a duplicate of this bug. ***
This implements the suggestions I made above, except it retains explicit mount
points.  This won't keep them from going stale.  I don't have a documented
procedure to reproduce the stale mount points, other than that I have noticed
synchronicity problems (volumes not unmounted or devices not released until
after detach returns).	I think that we should take it seriously, since this
started happening well before the packaging changes landed.

So, if you don't like mounting in /Volumes, how would you feel about using
-mountroot instead of -mountpoint?  That treats the argument as a replacement
for /Volumes and creates a mount point within, rather than treating the
argument as the direct mount point.  It still might not solve the staleness
problem, but it might be worth a shot to see if it does work.  We can use
-mountpoint as long as 10.3 minimum is in use.
Assignee: nobody → mark
Status: NEW → ASSIGNED
Comment on attachment 193439 [details] [diff] [review]
More reliable unpackaging

Let's try it on trunk.
Attachment #193439 - Flags: review+
Checked in to trunk.  Marking FIXED, reopen/refile if stale mount points
reappear; requesting approval1.8b4 for trunk fun.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Comment on attachment 193439 [details] [diff] [review]
More reliable unpackaging

branch fun, I mean
Attachment #193439 - Flags: approval1.8b4?
Reopening because this caused bustage.  When hdiutil attaches disk images that
it hasn't verified yet, it verifies them, producing output on stdout.  The
change to the code to extract DEV_NAME on disk images with licenses assumed that
the first line would contain the device name, but this is not the case when
dealing with unverified images.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch l10n bustage fixSplinter Review
The stale mount points will need to be reclaimed on maya once this lands.
Attachment #193532 - Flags: review?(benjamin)
Comment on attachment 193532 [details] [diff] [review]
l10n bustage fix

are the "echo" statements for debugging (will be removed before checkin)?
yes, echos are debug-only and will be removed
Attachment #193532 - Flags: review?(benjamin) → review+
Fixed on trunk.  Someone needs to unmount the temp mounts on maya.
Status: REOPENED → RESOLVED
Closed: 19 years ago19 years ago
Resolution: --- → FIXED
I've rebooted maya to fix the hung mountpoints.
This needs to be done on the branch in order to get Mac l10n builds.  Right now,
l10n branch builds are failing because the disk images now include EULAs but
don't have a 24-hour operator standing by to click "Accept."

hdiutil: attach failed - user canceled operation
(because stdin's not hooked up to anything at all.)
Flags: blocking1.8b4?
Severity: normal → major
Flags: blocking1.8b5+
Flags: blocking1.8b4?
Flags: blocking1.8b4+
Target Milestone: --- → mozilla1.8beta4
Severity: major → normal
Flags: blocking1.8b5+
Target Milestone: mozilla1.8beta4 → ---
Attachment #193439 - Flags: approval1.8b4? → approval1.8b4+
Keywords: fixed1.8
Have the mountpoints on Maya been removed yet since the branch checkin?
(In reply to comment #20)
> Have the mountpoints on Maya been removed yet since the branch checkin?

maya never got a chance to mount a branch dmg since its most recent boot, so
there shouldn't be any stuck mount points.
So, this worked well on the trunk but it's not flying on the branch.  I suggest
rebooting maya first, since the branch and trunk are now running exactly the
same code, and it's working on the trunk on the same machine - maybe those
"cancelled" mount attempts during previous branch builds got the system into a
weird state?

If that doesn't work, it's time to seriously consider letting the system mount
in /Volumes or an out-of-tree mountroot and (or?) possibly making the detach
operation nonfatal, along the lines of detach || (sleep 5 && detach -force).

The failure:

set -e; unset NEXT_ROOT; export PAGER=true; mkdir mount-temp; echo Y | hdiutil
attach -readonly -mountpoint mount-temp -private -noautoopen
/builds/tinderbox/Fx-Mozilla1.8-l10n/Darwin_7.9.0_Clobber/firefox.dmg >
hdi.output; DEV_NAME=`perl -n -e 'if($_=~/(\/dev\/disk[^ ]*)/) {print
$1."\n";exit;}'< hdi.output`; MOUNTPOINT=`perl -n -e 'split(/\/dev\/disk[^
]*/,$_,2);if($_[1]=~/(\/.*)/) {print $1."\n";exit;}'< hdi.output`; rsync -a
${MOUNTPOINT}/DeerPark.app firefox; hdiutil detach ${DEV_NAME}; 
"disk1" failed to unmount (0x0000C001)

That's the error you get when a mount is busy.  (Poking it with lsof might be
educational.)

Failed on the first attempted locale for Firefox:

http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8-l10n-bg/1124946420.18733.gz&fulltext=1

Thunderbird:

http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8-l10n-ca/1124947140.30685.gz&fulltext=1

Subsequent locales (and subsequent attempts for these builds until reboot) die
on the stuck mount:

http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8-l10n-fr/1124947140.31737.gz&fulltext=1
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8-l10n-fr/1124947140.31737.gz&fulltext=1
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It seemed to work on my tree last night, at least for Thunderbird:

---------------
cd ../../dist/l10n-stage && \
  set -e; unset NEXT_ROOT; export PAGER=true; mkdir mount-temp; echo Y | hdiutil
attach -readonly -mountpoint mount-temp -private -noautoopen
/Volumes/6GB_IBM_TEMP/SRC_ROOT_MZ18/mozilla/thunderbird-objdir/dist/thunderbird-1.0+.en-US.mac.dmg
> hdi.output; DEV_NAME=`perl -n -e 'if($_=~/(\/dev\/disk[^ ]*)/) {print
$1."\n";exit;}'< hdi.output`; MOUNTPOINT=`perl -n -e 'split(/\/dev\/disk[^
]*/,$_,2);if($_[1]=~/(\/.*)/) {print $1."\n";exit;}'< hdi.output`; rsync -a
${MOUNTPOINT}/Thunderbird.app thunderbird; hdiutil detach ${DEV_NAME}; 
"disk3" unmounted.
"disk3" ejected.
--------------------

Strange that it doesn't work on maya.
Rebooted maya and restarted scripts.
Now getting this on the trunk, although nothing significant has changed since
the recent successful cycles:

hdiutil: attach failed - codec overrun

http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-l10n-fr/1125001140.8160.gz&fulltext=1

So is it not picking up good dmgs from atlantia?

This makes me want to cry.
The codec overrun is probably just a bad dmg.  It only affected Thunderbird/trunk.

Firefox/trunk went red because of bug 305686 - backed out until I've got
something better.

I still expect the branch to go green.
But it didn't.  It's still getting

"disk1" failed to unmount (0x0000C001)

probably for reasons similar to those causing it to be broken before the new
packager landed.

I have to do more work on the unpackager for bug 305686 anyway, I'll integrate
some of my suggestions from comment 22.
the error I am seeing on branch thunderbird is different, though - more like the
one described in comment #25, but with an additional warning:
-----------------------------------
make[1]: *** Warning: File
`/builds/tinderbox/Tb-Mozilla1.8-l10n/Darwin_7.9.0_Clobber/thunderbird.dmg' has
modification time in the future (2005-08-25 13:54:18 > 2005-08-25 13:54:02)
-----------------------------------
weird...
On the -de tinderbox (and other "early" ones), you mean?  I noticed that too. 
It's because the file is downloaded (with wget) from staging and brings the
file's original timestamp with it.  The clocks aren't in sync.  You'll see that
the messages are gone after some time has elapsed and it gets to -fr.  Odd that
the times worked out this way with both branch and trunk, but it seems plausible
given the end times of triton's recent builds.

So now there are some bogus mounts again because of the rsyncs that failed.  set
-e is too simple-minded for our purposes here.
Trying to unmount the mounted disk array is interesting.

$ umount /dev/disk1s2
/builds/tinderbox/Fx-Mozilla1.8-l10n/Darwin_7.9.0_Clobber/mozilla/dist/l10n-stage/mount-t:
No such file or directory

ls on .../mount-t shows it doesn't exist (of course), there's .../mount-temp/. 
Gah, buffers.
(In reply to comment #25)
> Now getting this on the trunk, although nothing significant has changed since
> the recent successful cycles:
> 
> hdiutil: attach failed - codec overrun
> 
>
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-l10n-fr/1125001140.8160.gz&fulltext=1
> 
> So is it not picking up good dmgs from atlantia?

Probably maya was downloading the .dmg file from stage.m.o at the same time
triton was uploading a new .dmg file.  This seems possible given the available
data.  triton started building 13:38, finished 13:57.  maya's wget started
13:53:31, finished 13:53:34.

Since all of the build systems use scp to upload, they are streaming bytes
directly to these files.  This is the first time I recall something like this
happen in our build farm.  Ideally we should use rsync to upload builds.  (rsync
writes to '.filename' and move it to 'filename' after the transfer is complete.)

Even more ideally, we should download the nightly builds, not the hourly builds,
generally for different reasons but also because they are uploaded to a dated
directory on stage.m.o and then copied locally (lessening the window in which a
build file could be grabbed).

> This makes me want to cry.

I second your emotion.
Attached patch v3, revamp to stop the tears (obsolete) — Splinter Review
With Chase's help on maya, we learned some interesting things that should help
fix this for good.  Here's what I'm up to this time:

Mounts are in /tmp now.  It's done with -mountroot, so the system will pick a
good name that's not in use.  This helps in case something gets wedged. 
Keeping the mounts out of the tree means rm -rf l10n-stage won't croak if the
unmount fails.	Most importantly, it avoids the asinine 89-character fixed
buffer limitation.

This is much better about unmounting, too.  It unmounts whenever an interim
operation (like an rsync) fails, then it throws the build process.  And if an
unmount attempt fails, it backs off for a few seconds and then attempts to
forcibly unmount.

The other changes are in support of bug 305686.  Benjamin, I've made the
assumption that dist/branding exists in the l10n tree at unpackaging time and
stays around through repackaging.  If the assumption is incorrect, we'll need
to mkdir -p it in (browser|mail)/locales/Makefile.in - will cover if necessary
in 305686, please advise.

I'm leaving set -x in intentionally this time, because UNMOUNT_PACKAGE is hairy
and it'll be nice to be clued in if something else goes wrong, now or ever.

I thought that this would come out much more disgustingly than it did.	I've
kind of got a soft spot for it now, in that ugly-baby sort of way.

I've got to reboot to clean out some hung mount points.
Attachment #193892 - Flags: review?(benjamin)
Comment on attachment 193892 [details] [diff] [review]
v3, revamp to stop the tears

evil version of the patch, hold on
Attachment #193892 - Attachment is obsolete: true
Attachment #193892 - Flags: review?(benjamin)
Always quote optional arguments in the section that rips the resources out of
the disk image to avoid shell vomit.
Attachment #193894 - Flags: review?(benjamin)
Comment on attachment 193894 [details] [diff] [review]
v4, stop the tears from the other eye

+    return $$1 || $$?; \
should be
+    return $$1 && $$?; \
Comment on attachment 193894 [details] [diff] [review]
v4, stop the tears from the other eye

Also need to quote all uses of ${MOUNTPOINT}:

+  rsync -a "$${MOUNTPOINT}/$(_APPNAME)" $(MOZ_PKG_APPNAME) || cleanup 1; \

The only weird one is this:

+    (rsync -a "$${MOUNTPOINT}/.background/`basename
"$(MOZ_PKG_MAC_BACKGROUND)"`" "$(MOZ_PKG_MAC_BACKGROUND)" || cleanup 1); \

because the quote nesting looks wrong, but I've tested it and it's right.
Attachment #193894 - Flags: review?(benjamin) → review+
The latest batch of fixes are on the trunk.
Status: REOPENED → RESOLVED
Closed: 19 years ago19 years ago
Resolution: --- → FIXED
Keywords: fixed1.8
Attachment #193894 - Flags: approval1.8b4?
Attachment #193894 - Flags: approval1.8b4? → approval1.8b4+
Keywords: fixed1.8
"codec overrun" again on trunk Thunderbird - timestamp on staging is 1306, build
attempt completed 1309, so it fits again.  Not funny.
Finally, the first successful branch Mac l10n, complete with EULA repackaging:

http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8-l10n-fr/1125087480.12036.gz&fulltext=1
(In reply to comment #38)
> "codec overrun" again on trunk Thunderbird - timestamp on staging is 1306, build
> attempt completed 1309, so it fits again.  Not funny.

Teh suck.  I've done emergency surgery on triton's build scripts to use rsync. 
I wish we hadn't needed to do this under pressure but it looked like something
that must happen asap.

Respins soon to be happening on all of triton's trees.  Keep your fingers
crossed.  If it works the changes can be replicated to all other build systems.
(In reply to comment #40)
> Respins soon to be happening on all of triton's trees.  Keep your fingers
> crossed.  If it works the changes can be replicated to all other build systems.

Looks like it worked based on the output of the latest trunk build process. 
I'll commit my changes.
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: