Closed Bug 1244446 Opened 4 years ago Closed 3 years ago

Intermittent failure to running rmtree on Linux, "WARNING: Command failed. See logs for output.", "# umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/tmp/ccache", etc.

Categories

(Firefox Build System :: General, defect)

Unspecified
Linux
defect
Not set

Tracking

(firefox50 fixed)

RESOLVED FIXED
mozilla50
Tracking Status
firefox50 --- fixed

People

(Reporter: aryx, Assigned: glandium)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files)

https://treeherder.mozilla.org/logviewer.html#?job_id=20771996&repo=mozilla-inbound

Also seen similar behavior yesterday, can't find a bug for it.

01:43:12    ERROR -  WARNING: Command failed. See logs for output.
01:43:12    ERROR -   # umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/sys
01:43:12    ERROR -  WARNING: Command failed. See logs for output.
01:43:12    ERROR -   # umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/proc
01:43:12    ERROR -  Traceback (most recent call last):
01:43:12    ERROR -    File "/usr/sbin/mock_mozilla", line 862, in <module>
01:43:12    ERROR -      main(retParams)
01:43:12    ERROR -    File "/usr/sbin/mock_mozilla", line 740, in main
01:43:12    ERROR -      sys.exit(chroot.shell(options, cmd))
01:43:12    ERROR -    File "/usr/lib/python2.6/site-packages/mock_mozilla/backend.py", line 659, in shell
01:43:12    ERROR -      self._setupDev()
01:43:12    ERROR -    File "<peak.util.decorators.rewrap wrapping mock_mozilla.backend._setupDev at 0x013F3230>", line 3, in _setupDev
01:43:12    ERROR -    File "/usr/lib/python2.6/site-packages/mock_mozilla/trace_decorator.py", line 70, in trace
01:43:12    ERROR -      result = func(*args, **kw)
01:43:12    ERROR -    File "/usr/lib/python2.6/site-packages/mock_mozilla/backend.py", line 386, in _setupDev
01:43:12    ERROR -      mock_mozilla.util.rmtree(self.makeChrootPath("dev"), selinux=self.selinux)
01:43:12    ERROR -    File "<peak.util.decorators.rewrap wrapping mock_mozilla.util.rmtree at 0x013E0938>", line 3, in rmtree
01:43:12    ERROR -    File "/usr/lib/python2.6/site-packages/mock_mozilla/trace_decorator.py", line 70, in trace
01:43:12    ERROR -      result = func(*args, **kw)
01:43:12    ERROR -    File "/usr/lib/python2.6/site-packages/mock_mozilla/util.py", line 87, in rmtree
01:43:12    ERROR -      shutil.rmtree(path, *args, **kargs)
01:43:12    ERROR -    File "/usr/lib64/python2.6/shutil.py", line 212, in rmtree
01:43:12    ERROR -      rmtree(fullname, ignore_errors, onerror)
01:43:12    ERROR -    File "/usr/lib64/python2.6/shutil.py", line 221, in rmtree
01:43:12    ERROR -      onerror(os.rmdir, path, sys.exc_info())
01:43:12    ERROR -    File "/usr/lib64/python2.6/shutil.py", line 219, in rmtree
01:43:12    ERROR -      os.rmdir(path)
01:43:12    ERROR -  OSError: [Errno 16] Device or resource busy: '/builds/mock_mozilla/mozilla-centos6-x86_64/root/dev/shm'
01:43:12    ERROR - Return code: 1
01:43:12  WARNING - setting return code to 2
01:43:12    FATAL - Halting on failure while running ['mock_mozilla', '-r', 'mozilla-centos6-x86_64', '-q', '--cwd', '/builds/slave/m-in-l64-000000000000000000000/build/src/obj-firefox', '--unpriv', '--shell', '/usr/bin/env MOZ_AUTOMATION=1 "LESSOPEN=|/usr/bin/lesspipe.sh %s" TMOUT=86400 LOGNAME=cltbld USER=cltbld PATH=/tools/buildbot/bin:/usr/local/bin:/usr/lib64/ccache:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/tools/git/bin:/tools/python27/bin:/tools/python27-mercurial/bin:/home/cltbld/bin SYMBOL_SERVER_USER=ffxbld DISPLAY=:2 CCACHE_UMASK=002 LANG=en_US.UTF-8 CCACHE_HASHDIR= TERM=linux SHELL=/bin/bash MOZ_SIGNING_SERVERS=gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing4.srv.releng.scl3.mozilla.com:9110,gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing5.srv.releng.scl3.mozilla.com:9110,gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing6.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing1.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing2.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing3.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing4.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing6.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing7.srv.releng.scl3.mozilla.com:9110 SHLVL=1 G_BROKEN_FILENAMES=1 HISTSIZE=1000 SYMBOL_SERVER_PATH=/mnt/netapp/breakpad/symbols_ffx/ CCACHE_DIR=/builds/ccache HG_SHARE_BASE_DIR=/builds/hg-shared SYMBOL_SERVER_HOST=symbolpush.mozilla.org MAIL=/var/spool/mail/cltbld MOZ_BUILD_DATE=20160130013032 TINDERBOX_OUTPUT=1 "MOZ_SIGN_CMD=python /builds/slave/m-in-l64-000000000000000000000/build/tools/release/signing/signtool.py --cachedir /builds/slave/m-in-l64-000000000000000000000/build/signing_cache -t /builds/slave/m-in-l64-000000000000000000000/token -n /builds/slave/m-in-l64-000000000000000000000/nonce -c /builds/slave/m-in-l64-000000000000000000000/build/tools/release/signing/host.cert -H gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing4.srv.releng.scl3.mozilla.com:9110 -H gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing5.srv.releng.scl3.mozilla.com:9110 -H gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing6.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing1.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing2.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing3.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing4.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing6.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing7.srv.releng.scl3.mozilla.com:9110" LC_ALL=C TOOLTOOL_HOME=/builds _=/tools/buildbot/bin/python LD_LIBRARY_PATH=/tools/gcc-4.3.3/installed/lib64 MOZ_SYMBOLS_EXTRA_BUILDID=linux64-mozilla-inbound HOSTNAME=bld-linux64-spot-1052.build.releng.usw2.mozilla.com SYMBOL_SERVER_SSH_KEY=/home/mock_mozilla/.ssh/ffxbld_rsa MOZBUILD_STATE_PATH=/builds/slave/m-in-l64-000000000000000000000/.mozbuild HISTCONTROL=ignoredups POST_SYMBOL_UPLOAD_CMD=/usr/local/bin/post-symbol-upload.py PWD=/builds/slave/m-in-l64-000000000000000000000 PROPERTIES_FILE=/builds/slave/m-in-l64-000000000000000000000/buildprops.json MOZ_CRASHREPORTER_NO_REPORT=1 CCACHE_COMPRESS=1 TOOLTOOL_CACHE=/builds/tooltool_cache python2.7 /builds/slave/m-in-l64-000000000000000000000/build/src/mach python /builds/slave/m-in-l64-000000000000000000000/build/src/config/printconfigsetting.py /builds/slave/m-in-l64-000000000000000000000/build/src/obj-firefox/dist/bin/application.ini App SourceStamp']
01:43:12    FATAL - Running post_fatal callback...
01:43:12    FATAL - Exiting 2
Summary: Intermittent failure to running rmtree on Linux → Intermittent failure to running rmtree on Linux, "WARNING: Command failed. See logs for output.", "# umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/tmp/ccache", etc.
catlee, coop: this seems to be more frequent (had 4 instances also today so far), can we get someone from releng to look at this ?
Flags: needinfo?(coop)
Flags: needinfo?(catlee)
cc-ing the buildduty people so they can help.

Based on the graph, it seems like this bug started re-occurring between March 7 and 8:

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1244446&startday=2016-01-01&tree=all

We should start by looking for infra changes around that time frame. e.g. is this related to the new clobberer system :garbas is deploying? I'm not sure on the timing there.

Also, the log in comment #0 is for a linux64 opt build, but are they all opt builds?
Flags: needinfo?(coop)
Flags: needinfo?(catlee)
(In reply to Chris Cooper [:coop] from comment #3)

> Also, the log in comment #0 is for a linux64 opt build, but are they all opt
> builds?

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1244446&startday=2016-01-01&tree=all says opt and debug
If an infra failure doesn't have a bug which is suggested by treeherder, which this one didn't until the morning of the 8th, then you really don't know anything at all about its frequency. We star hundreds, perhaps thousands, of infra failures as just "infra" without using a bug number, every day.

I restarred fx-team back to March 1st, so https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1244446&entireHistory=true&tree=trunk is a little more accurate now.
Any updates here?
Flags: needinfo?(coop)
Did some investigation here. Here's what I've found up until now:

Faulty run: https://treeherder.mozilla.org/logviewer.html#?job_id=25983380&repo=mozilla-inbound
Successful run: https://treeherder.mozilla.org/logviewer.html#?job_id=25993102&repo=mozilla-inbound

Taking a look at the logs, it can be noticed that the failure from the first case is due to the fact that "umount -n" fails.
    15:23:55     INFO -  WARNING: Command failed. See logs for output.
    15:23:55     INFO -   # umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/dev/shm
    15:23:55     INFO -  WARNING: Command failed. See logs for output.
    15:23:55     INFO -   # umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/builds/slave

It then forcibly unmounts the files in question:
    15:23:55     INFO -  WARNING: Forcibly unmounting '/builds/mock_mozilla/mozilla-centos6-x86_64/root/dev/shm' from chroot.
    15:23:55     INFO -  WARNING: Forcibly unmounting '/builds/mock_mozilla/mozilla-centos6-x86_64/root/builds/slave' from chroot.
    15:23:55     INFO - Return code: 0

Even though it returns code 0, it seems that the issue is not solved at this point, as we have the following error below:
    15:23:58    ERROR -  OSError: [Errno 16] Device or resource busy: '/builds/mock_mozilla/mozilla-centos6-x86_64/root/dev/shm'

While doing some research in order to better understand mounting/unmounting filesytems, I was wondering if we could use a lazy unmount ("umount -l") instead of forcing it.
Flags: needinfo?(coop)
vlad, alin: you might want to look here https://wiki.mozilla.org/ReleaseEngineering/Applications/Mock and at the source code to see if there is a retry to backoff and umount again
Changing the component, as we had little luck investigating the issue. Maybe some other folks will have a better idea on how to solve this.
Component: Buildduty → General Automation
QA Contact: bugspam.Callek → catlee
Could we add some code to run lsof when we get into this state? Having the lsof output in the log will help us figure out what's holding the mount open.
I did a try push that adds some debugging info here: https://hg.mozilla.org/try/rev/d6b582c03c066f30356a8bc43b94d834ccd7507c

And according to the log[1], it looks like maybe the sccache server is still running. Maybe this is preventing unmounting the chroot dirs?

http://archive.mozilla.org/pub/firefox/try-builds/catlee@mozilla.com-d6b582c03c066f30356a8bc43b94d834ccd7507c/try-linux64/try-linux64-bm78-try1-build9554.txt.gz
Flags: needinfo?(ted)
Flags: needinfo?(mh+mozilla)
The sccache server is supposed to be killed by the postflight in build/sccache.mk, but that only happens after a successful build.

But yes, running processes can very much prevent unmounting things. That's why these days it's better to use container-like things than chroots, (or at the very least cgroups).
Flags: needinfo?(mh+mozilla)
can we have mach clean up here? we're not going to be able to get off of mock (chroot) in the short term
You're better off using cgroups (Or "simply" lsof for that matter). mach can't know for sure all the things that may or may not require being killed. Today it's sccache. Tomorrow it might be something else. If the glibc in the mock environment was more recent, until bug 1274497 it would have been adb too.
You probably already know, but mock can use systemd-nspawn containers since https://bugzilla.redhat.com/show_bug.cgi?id=1132762 - this probably doesn't help on Centos 6, though.
this is spiking and now a frequent failure :(
Component: General Automation → Build Config
Product: Release Engineering → Core
QA Contact: catlee
I'm pretty sure this is due to the sccache not being killed properly. I think mach is the best place to fix this, since by the time mozharness gets back control, the mock environment is already in an inconsistent state.
Considering actual logs and why this is happening, I'm mystified that this didn't start happening earlier. way earlier.
Assignee: nobody → mh+mozilla
At this point, the only remaining part of the postflight for OSX
universal builds dates back to bug 834228. Since then, many things have
changed, one of them being that automation build steps have dependencies
that can be expressed through make dependencies.

While this is not directly related to bug 1244446, fixing this bug gets
more complicated if postflight needs to happen before some of the
automation build steps.

Review commit: https://reviewboard.mozilla.org/r/56736/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/56736/
Attachment #8758482 - Flags: review?(gps)
Attachment #8758483 - Flags: review?(gps)
Killing the sccache background daemon is part of postflight_all, but in
the current setup, postflight_all happens at the end of a "normal" build,
but we run automation build steps after that.

What happens then is that more compilations happen (gtests), which start
sccache again, but there's nothing to kill sccache again once this is
all done.

Now that the OSX universal builds postflight is gone, it is not
necessary for postflight_all to happen before the automation build steps.
So ensure postflight_all scripts happen last.

The downside of this change is that this now prevents sccache.log from
being uploaded, but we should probably send processed data to the graph
server instead.

Review commit: https://reviewboard.mozilla.org/r/56738/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/56738/
Comment on attachment 8758482 [details]
MozReview Request: Bug 1244446 - Remove OSX universal builds postflight. r?gps

https://reviewboard.mozilla.org/r/56736/#review53450

I'm not convinced the deleted `rm -rf` and `ln -s` in `flight.mk` aren't actually needed. But I would think this would break pretty clearly if it is. So I'll r+ and let automation be the arbiter of correct.
Attachment #8758482 - Flags: review?(gps) → review+
Comment on attachment 8758483 [details]
MozReview Request: Bug 1244446 - Run automation build steps before postflight_all. r?gps

https://reviewboard.mozilla.org/r/56738/#review53460

You may want to check with ted to make sure he's ok losing sccache logs. Actually, you may want to run this series by him since he knows OS X universal builds a bit better than me. But otherwise LGTM.
Attachment #8758483 - Flags: review?(gps) → review+
Attachment #8758482 - Flags: review?(ted)
Attachment #8758483 - Flags: review?(ted)
(In reply to Gregory Szorc [:gps] from comment #37)
> Comment on attachment 8758482 [details]
> MozReview Request: Bug 1244446 - Remove OSX universal builds postflight.
> r?gps
> 
> https://reviewboard.mozilla.org/r/56736/#review53450
> 
> I'm not convinced the deleted `rm -rf` and `ln -s` in `flight.mk` aren't
> actually needed. But I would think this would break pretty clearly if it is.
> So I'll r+ and let automation be the arbiter of correct.

Try seems to agree they're not necessary, but I wouldn't be surprised if that breaks some random job that is not on try.
Attachment #8758482 - Flags: review?(ted) → review+
Comment on attachment 8758482 [details]
MozReview Request: Bug 1244446 - Remove OSX universal builds postflight. r?gps

https://reviewboard.mozilla.org/r/56736/#review54292

Nice! Maybe we can get rid of client.mk soon...
Comment on attachment 8758483 [details]
MozReview Request: Bug 1244446 - Run automation build steps before postflight_all. r?gps

https://reviewboard.mozilla.org/r/56738/#review54294

Sad to put more logic in client.mk, but that logic in the mach command wasn't exactly fantastic.
Attachment #8758483 - Flags: review?(ted) → review+
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e89afb1745a1
Remove OSX universal builds postflight. r=gps,r=ted
https://hg.mozilla.org/integration/mozilla-inbound/rev/1d2c46c8c7c7
Run automation build steps before postflight_all. r=gps,r=ted
sorry had to back this out since somehow win8x64 pgo failed after this push like https://treeherder.mozilla.org/logviewer.html#?job_id=29541401&repo=mozilla-inbound
Flags: needinfo?(mh+mozilla)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7ab897482e06
Backed out changeset 1d2c46c8c7c7 
https://hg.mozilla.org/integration/mozilla-inbound/rev/ed9a7b145c3c
Backed out changeset e89afb1745a1 for windows pgo bustage
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e1bd07ec0b5b
Remove OSX universal builds postflight. r=gps,r=ted
https://hg.mozilla.org/integration/mozilla-inbound/rev/aa8cdba912d3
Run automation build steps before postflight_all. r=gps,r=ted
Flags: needinfo?(mh+mozilla)
https://hg.mozilla.org/mozilla-central/rev/e1bd07ec0b5b
https://hg.mozilla.org/mozilla-central/rev/aa8cdba912d3
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Flags: needinfo?(ted)
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.