Closed Bug 774844 Opened 11 years ago Closed 10 years ago

Intermittent OS X 10.7 leakstats | log file incomplete (after "obj-firefox/dist/bin/leakstats: no callsite for 'F' (70)! | Unknown event type 0xffffff90")

Categories

(Release Engineering :: General, defect)

x86_64
macOS
defect
Not set
major

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: emorley, Unassigned)

References

()

Details

(Keywords: intermittent-failure, Whiteboard: [red])

Not sure if this should be in Core::* or release engineering.

OS X 10.7 64-bit mozilla-inbound leak test build on 2012-07-17 08:28:49 PDT for push 0888a88ab4c7

slave: bld-lion-r5-080

https://tbpl.mozilla.org/php/getParsedLog.php?id=13605090&tree=Mozilla-Inbound

{
========= Started compare current leak logs failed (results: 2, elapsed: 0 secs) (at 2012-07-17 09:02:59.237050) =========
obj-firefox/dist/bin/leakstats ../malloc.log
 in dir /builds/slave/m-in-osx64-dbg/build (timeout 1200 secs)
 watching logfiles {}
 argv: ['obj-firefox/dist/bin/leakstats', '../malloc.log']
 environment:
  Apple_PubSub_Socket_Render=/tmp/launch-6NvL74/Render
  CCACHE_BASEDIR=/builds/slave/m-in-osx64-dbg
  CCACHE_COMPRESS=1
  CCACHE_DIR=/builds/ccache
  CCACHE_UMASK=002
  CVS_RSH=ssh
  DISPLAY=/tmp/launch-DIb19R/org.x:0
  HG_SHARE_BASE_DIR=/builds/hg-shared
  HOME=/Users/cltbld
  LC_ALL=C
  LOGNAME=cltbld
  MOZ_CRASHREPORTER_NO_REPORT=1
  MOZ_OBJDIR=obj-firefox
  MOZ_SIGN_CMD=python /builds/slave/m-in-osx64-dbg/tools/release/signing/signtool.py --cachedir /builds/slave/m-in-osx64-dbg/signing_cache -t /builds/slave/m-in-osx64-dbg/token -n /builds/slave/m-in-osx64-dbg/nonce -c /builds/slave/m-in-osx64-dbg/tools/release/signing/host.cert -H mac-signing3.build.scl1.mozilla.com:9100 -H mac-signing4.build.scl1.mozilla.com:9100
  PATH=/tools/python/bin:/tools/buildbot/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
  PWD=/builds/slave/m-in-osx64-dbg/build
  PYTHONPATH=/tools/buildbot/lib/python2.7/site-packages
  SHELL=/bin/bash
  SSH_AUTH_SOCK=/tmp/launch-ckJ4IH/Listeners
  TMPDIR=/var/folders/30/yq_p3wk15yb9wdsv6sm_m0v00000gn/T/
  USER=cltbld
  VERSIONER_PYTHON_PREFER_32_BIT=no
  VERSIONER_PYTHON_VERSION=2.7
  XPCOM_DEBUG_BREAK=stack-and-abort
  __CF_USER_TEXT_ENCODING=0x1F5:0:0
 using PTY: False
obj-firefox/dist/bin/leakstats starting at Tue Jul 17 09:02:59 2012
obj-firefox/dist/bin/leakstats: no callsite for 'F' (70)!
Unknown event type 0xffffff90
obj-firefox/dist/bin/leakstats: log file incomplete
program finished with exit code 1
elapsedTime=0.245636
Unable to parse leakstats output
========= Finished compare current leak logs failed (results: 2, elapsed: 0 secs) (at 2012-07-17 09:02:59.506804) =========
}
Summary: OS X 10.7 build failure with: "obj-firefox/dist/bin/leakstats: no callsite for 'F' (70)! | Unknown event type 0xffffff90 | obj-firefox/dist/bin/leakstats: log file incomplete | program finished with exit code 1" → OS X 10.7 compare current leak logs failure with: "obj-firefox/dist/bin/leakstats: no callsite for 'F' (70)! | Unknown event type 0xffffff90 | obj-firefox/dist/bin/leakstats: log file incomplete | program finished with exit code 1"
https://tbpl.mozilla.org/php/getParsedLog.php?id=13761534&tree=Mozilla-Inbound

{
========= Started compare current leak logs failed (results: 2, elapsed: 0 secs) (at 2012-07-22 19:33:16.858487) =========
obj-firefox/dist/bin/leakstats ../malloc.log
 in dir /builds/slave/m-in-osx64-dbg/build (timeout 1200 secs)
 watching logfiles {}
 argv: ['obj-firefox/dist/bin/leakstats', '../malloc.log']
 environment:
  Apple_PubSub_Socket_Render=/tmp/launch-9hZd6s/Render
  CCACHE_BASEDIR=/builds/slave/m-in-osx64-dbg
  CCACHE_COMPRESS=1
  CCACHE_DIR=/builds/ccache
  CCACHE_UMASK=002
  CVS_RSH=ssh
  DISPLAY=/tmp/launch-ohFZM5/org.x:0
  HG_SHARE_BASE_DIR=/builds/hg-shared
  HOME=/Users/cltbld
  LC_ALL=C
  LOGNAME=cltbld
  MOZ_CRASHREPORTER_NO_REPORT=1
  MOZ_OBJDIR=obj-firefox
  MOZ_SIGN_CMD=python /builds/slave/m-in-osx64-dbg/tools/release/signing/signtool.py --cachedir /builds/slave/m-in-osx64-dbg/signing_cache -t /builds/slave/m-in-osx64-dbg/token -n /builds/slave/m-in-osx64-dbg/nonce -c /builds/slave/m-in-osx64-dbg/tools/release/signing/host.cert -H mac-signing3.build.scl1.mozilla.com:9100 -H mac-signing4.build.scl1.mozilla.com:9100
  PATH=/tools/python/bin:/tools/buildbot/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
  PWD=/builds/slave/m-in-osx64-dbg/build
  PYTHONPATH=/tools/buildbot/lib/python2.7/site-packages
  SHELL=/bin/bash
  SSH_AUTH_SOCK=/tmp/launch-IMAG00/Listeners
  TMPDIR=/var/folders/30/yq_p3wk15yb9wdsv6sm_m0v00000gn/T/
  USER=cltbld
  VERSIONER_PYTHON_PREFER_32_BIT=no
  VERSIONER_PYTHON_VERSION=2.7
  XPCOM_DEBUG_BREAK=stack-and-abort
  __CF_USER_TEXT_ENCODING=0x1F5:0:0
 using PTY: False
obj-firefox/dist/bin/leakstats starting at Sun Jul 22 19:33:16 2012
Unknown event type 0xffffffe8
obj-firefox/dist/bin/leakstats: log file incomplete
program finished with exit code 1
elapsedTime=0.174396
Unable to parse leakstats output
========= Finished compare current leak logs failed (results: 2, elapsed: 0 secs) (at 2012-07-22 19:33:17.055132) =========
}
Depends on: 539334
Whiteboard: [orange][red] → [red]
Depends on: 828946
Summary: OS X 10.7 compare current leak logs failure with: "obj-firefox/dist/bin/leakstats: no callsite for 'F' (70)! | Unknown event type 0xffffff90 | obj-firefox/dist/bin/leakstats: log file incomplete | program finished with exit code 1" → Intermittent OS X 10.7 leakstats | log file incomplete (after "obj-firefox/dist/bin/leakstats: no callsite for 'F' (70)! | Unknown event type 0xffffff90")
Note this doesn't appear on OrangeFactor due to bug 694170.
Please may you find someone to own this (is a top orange at the moment)? :-)
Flags: needinfo?(catlee)
dbaron, any ideas about this?
Flags: needinfo?(dbaron)
Flags: needinfo?(catlee)
Is there any chance these machines are running out of disk space while running the test?  Do we know how large the files typically are?

Given that the diffbloatdump output isn't functioning correctly, I suspect the answer might be that we should just stop running the compare-logs part of this, at least.  (Though maybe it's working on other platforms.)  Though the part that's failing isn't actually that part.

(It used to show run-to-run differences of allocation stacks, so if a lot more memory were allocated in a particular stack than in the previous run, we'd show the stack for that allocation, in tree form.  In other words, the point of the log comparison was that for leak regressions in this test, the log would actually often have enough data in the log to debug the problem without doing anything else.)

Maybe chat with the memshrink folks, though; they might know if anybody is still using this.  (Given that we've mostly stopped caring about not cleaning stuff up at shutdown, this has become a less useful test.  It also had UI that worked better with tinderbox and doesn't work well with TBPL, so people stopped paying attention to it post tbpl-switch.)
Flags: needinfo?(dbaron) → needinfo?(khuey)
Except s/previous run/some other run/, since it stopped being a compare to anything other than "some random other run" the minute we got our second buildslave running on a platform, or at least within a few hours, the first time that one push did a clobber build while a later push did a dep build and finished before the previous push's build.

I think last time we wanted to kill it, Standard8 and one other person, I forget who, spoke up on its behalf.
One of the costs of this failure which doesn't appear in the bug is the way it is dealt with on Try - nearly everyone retriggers the build, and thus also triggers twice as many tests (on all three OS X versions).

An amusing, and very simple, way of dealing with it would be to just turn it into not-an-error, by taking out the TEST-UNEXPECTED-FAIL and returning 0. If we then completely break it without realizing it because it's silent, surely the people who rely on it working will notice, won't they?
Depends on: 887234