Closed
Bug 948145
Opened 12 years ago
Closed 6 years ago
Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-x86_6"
Categories
(Testing :: General, defect)
Tracking
(Not tracked)
RESOLVED
INACTIVE
People
(Reporter: n.nethercote, Assigned: gbrown)
References
(Blocks 1 open bug)
Details
(Keywords: intermittent-failure)
Valgrind-on-TBPL runs are now often timing out, with this message:
> command timed out: 1200 seconds without output, attempting to kill
https://tbpl.mozilla.org/php/getParsedLog.php?id=31676175&tree=Mozilla-Central is a recent example in which Valgrind didn't produce any output. Here's some of the output:
> + make pgo-profile-run 'EXTRA_TEST_ARGS=--debugger=valgrind --debugger-args='\''--error-exitcode=1 --smc-check=all-non-file --gen-suppressions=all --leak-check=full --num-callers=15 --show-possibly-lost=no --track-origins=yes --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/cross-architecture.sup --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/x86_64-redhat-linux-gnu.sup'\'''
> /builds/slave/m-cen-l64-valgrind-00000000000/objdir/_virtualenv/bin/python ../src/build/pgo/profileserver.py --debugger=valgrind --debugger-args='--error-exitcode=1 --smc-check=all-non-file --gen-suppressions=all --leak-check=full --num-callers=15 --show-possibly-lost=no --track-origins=yes --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/cross-architecture.sup --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/x86_64-redhat-linux-gnu.sup'
>
> command timed out: 1200 seconds without output, attempting to kill
> program finished with exit code 247
https://tbpl.mozilla.org/php/getParsedLog.php?id=31654180&tree=Mozilla-Central is the first example that I saw. Valgrind produced almost all of its output, with only a bit truncated at the end:
> ==1979== LEAK SUMMARY:
> ==1979== definitely lost: 9,032 bytes in 7 blocks
> ==1979== indirectly lost: 1,552 bytes in 6 blocks
> ==1979== possibly lost: 96,747 bytes in 1,749 blocks
> ==1979== still reachable: 3,452,056 bytes in 6,091 blocks
> ==1979== suppressed: 94,165 bytes in 1,775 blocks
> ==1979== Reachable blocks (those to which a pointer was found) ar
> command timed out: 1200 seconds without output, attempting to kill
> program finished with exit code 247
> elapsedTime=3493.368914
| Reporter | ||
Updated•12 years ago
|
Keywords: intermittent-failure
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Reporter | ||
Comment 15•12 years ago
|
||
I turned on extra logging (-d -v) in an attempt to understand this better:
https://hg.mozilla.org/build/tools/rev/2b49e8cb95c1
In many of the runs, Valgrind isn't even printing its start-up messages. Humph.
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Reporter | ||
Comment 26•12 years ago
|
||
I just retriggered some old m-c Valgrind runs from Dec 6, which was before the timeout showed up. They timed out. This suggests that it wasn't something that in m-c that changed to start causing the problems. Ideas:
- Changes to valgrind.sh in the tools/ repo? Or some other part of the testing infrastructure that is stored outside of mozilla-central?
- Changes to the build machines' configurations?
And it's really weird the way the valgrind output is being truncated. E.g. the final calls to VG_(debugLog)() are printing some but not all of their output, despite the fact that the actual writing of the output happens in a single call to the write() syscall, using stderr. It doesn't make sense that that operation could be failing/hanging/whatever... it really feels more like the harness that Valgrind is running inside is doing something wrong.
| Reporter | ||
Comment 27•12 years ago
|
||
Let's try running natively, instead of under Valgrind, in case that tells us anything:
https://hg.mozilla.org/build/tools/rev/7ad8a0bfc8cc
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Reporter | ||
Comment 47•12 years ago
|
||
> it really feels more like the harness that Valgrind is running inside is doing
> something wrong.
> Let's try running natively, instead of under Valgrind, in case that tells us
> anything:
The timeout still occurs! E.g.
https://tbpl.mozilla.org/php/getParsedLog.php?id=31726798&tree=Mozilla-Inbound
Well isn't that interesting.
| Reporter | ||
Comment 48•12 years ago
|
||
> Well isn't that interesting.
Just to clarify: this proves that it's not Valgrind that's hanging. So what other component could be responsible?
| Reporter | ||
Comment 49•12 years ago
|
||
> Let's try running natively, instead of under Valgrind, in case that tells us
> anything:
> https://hg.mozilla.org/build/tools/rev/7ad8a0bfc8cc
I reverted this change here:
https://hg.mozilla.org/build/tools/rev/8e17d3c1e37b
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Reporter | ||
Comment 124•12 years ago
|
||
I temporarily disabled the |make pgo-profile-run| step, to see how that affects things:
https://hg.mozilla.org/build/tools/rev/3db2a940cdc6
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Reporter | ||
Comment 127•12 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #124)
> I temporarily disabled the |make pgo-profile-run| step, to see how that
> affects things:
> https://hg.mozilla.org/build/tools/rev/3db2a940cdc6
They were green. Not terribly surprising. I reverted the change:
https://hg.mozilla.org/build/tools/rev/8a908482428e
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 130•12 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #26)
> I just retriggered some old m-c Valgrind runs from Dec 6, which was before
> the timeout showed up. They timed out. This suggests that it wasn't
> something that in m-c that changed to start causing the problems. Ideas:
>
> - Changes to valgrind.sh in the tools/ repo? Or some other part of the
> testing infrastructure that is stored outside of mozilla-central?
>
> - Changes to the build machines' configurations?
Ick. Maybe catlee can help us find out if anything changed on RelEng's side in that timeframe?
Flags: needinfo?(catlee)
Comment 131•12 years ago
|
||
I don't see any relevant changes happening in puppet in that time.
The primary change is that we switched from running nightly to running per-push. This has resulted in one change to the build environment: MOZ_UPDATE_CHANNEL is unset rather than being set to "nightly".
Flags: needinfo?(catlee)
Comment 132•12 years ago
|
||
This results in slightly different arguments to configure:
--enable-update-channel=nightly
vs
--enable-update-channel=
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 138•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=31789399&full=1&branch=fx-team has a smoking gun:
JavaScript error: http://localhost:8888/index.html, line 55: Quitter is not defined
This is:
http://mxr.mozilla.org/mozilla-central/source/build/pgo/index.html?force=1#55
If the Quitter extension isn't being installed and we hit that error then it makes sense that it would hang.
Comment 139•12 years ago
|
||
I'm wondering if there's something weird going on here where the quitter addon install fails silently and then we hang.
Comment 140•12 years ago
|
||
pretty sure http://hg.mozilla.org/build/tools/rev/c46f324a8b46 is the cause here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try some new jobs.
Comment 141•12 years ago
|
||
Comment 142•12 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #140)
> here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try
That strikes me as pretty unlikely to be the root cause, esp given that
it hangs when running natively (comment 47).
Some non-default setting of --vex-iropt-register-updates= now appears
to be pretty much mandatory; see eg bug 913876. I suspect
--vex-iropt-register-updates=allregs-at-mem-access is the required
minimum now.
Comment 143•12 years ago
|
||
> (In reply to Chris AtLee [:catlee] from comment #140)
> > here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try
Just to be sure, I reran V's testsuite on amd64-linux with this flag set,
and didn't get any failures as a result.
Comment 144•12 years ago
|
||
I was able to reproduce the hang on the same build with that flag set, and when I removed it it exited successfully....Maybe there's something else at play here?
I see some green V's on inbound prior to me landing.
This failed:
https://tbpl.mozilla.org/?showall=1&jobname=valgrind&tree=Mozilla-Inbound&rev=316ab101c526
And this passed:
https://tbpl.mozilla.org/?showall=1&jobname=valgrind&tree=Mozilla-Inbound&rev=092c3bc22ae8
this code changed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/10800c0df8b5
Comment 145•12 years ago
|
||
Oh, I guess that failure is a build failure not a valgrind failure...
| Reporter | ||
Comment 146•12 years ago
|
||
(In reply to Julian Seward from comment #142)
> (In reply to Chris AtLee [:catlee] from comment #140)
> > here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try
>
> That strikes me as pretty unlikely to be the root cause, esp given that
> it hangs when running natively (comment 47).
I agree. I was seeing hangs prior to setting that flag. Also, I re-triggered an old job from Dec 5 with an updated valgrind.sh which had that flag and it ran successfully.
> I see some green V's on inbound prior to me landing.
BTW, you can't trust green V's... well, hanging runs always show up as red, but other failures (such as crashes) show up as green due to bug 823787.
In fact, if you disable the --vex-iropt-register-updates=allregs-at-each-insn flag then Valgrind will sometimes crash before it hangs, and it'll show up as green, and you'll probably think everything went ok unless you look in the logs. (The crash message involves "clone".)
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Reporter | ||
Comment 148•12 years ago
|
||
I was able to reproduce on a build slave. I had to ctrl-c to kill it. Here's the relevant part of the output:
==10502== LEAK SUMMARY:
==10502== definitely lost: 9,032 bytes in 7 blocks
==10502== indirectly lost: 1,552 bytes in 6 blocks
==10502== possibly lost: 96,747 bytes in 1,749 blocks
==10502== still reachable: 3,449,372 bytes in 6,088 blocks
==10502== suppressed: 94,171 bytes in 1,775 blocks
==10502== Reachable blocks (those to which a pointer was found) are not shown.
==10502== To see them, rerun with: --leak-check=full --show-reachable=yes
==10502==
==10502== For counts of detected and suppressed errors, rerun with: -v
==10502== ERROR SUMMARY: 7 errors from 7 contexts (suppressed: 15 from 15)
JavaScript error: http://localhost:8888/index.html, line 55: Quitter is not defined
^CTraceback (most recent call last):
File "../src/build/pgo/profileserver.py", line 68, in <module>
runner.wait()
File "/builds/slave/m-cen-l64-valgrind-00000000000/src/testing/mozbase/mozrunner/mozrunner/runner.py", line 83, in wait
self.process_handler.wait(timeout)
File "/builds/slave/m-cen-l64-valgrind-00000000000/src/testing/mozbase/mozprocess/mozprocess/processhandler.py", line 741, in wait
self.outThread.join(timeout=1)
File "/tools/python27/lib/python2.7/threading.py", line 674, in join
self.__block.wait(delay)
File "/tools/python27/lib/python2.7/threading.py", line 262, in wait
_sleep(delay)
KeyboardInterrupt
WARNING: Command failed. See logs for output.
So Valgrind finished (as indicated by the "ERROR SUMMARY" line. The |Quitter is not defined| thing showed up after a while. And then |runner.wait()| ends up hanging.
| Reporter | ||
Comment 149•12 years ago
|
||
So my current theory is that something change in mozharness. Ted, what do you think, based on comment 148?
Flags: needinfo?(ted)
Comment 150•12 years ago
|
||
http://hg.mozilla.org/build/mozharness has all the changes to mozharness; which change do you suspect?
| Reporter | ||
Comment 151•12 years ago
|
||
> If the Quitter extension isn't being installed and we hit that error then it
> makes sense that it would hang.
I was discounting that because the "Quitter is not defined" message wasn't showing up in logs of some of the hanging jobs. Though maybe that's because buildbot's buffering is hiding it. But if it was a factor, I wouldn't expect Valgrind to finish? Hmm, not sure.
| Reporter | ||
Comment 152•12 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #150)
> http://hg.mozilla.org/build/mozharness has all the changes to mozharness;
> which change do you suspect?
None specifically; it's just a theory and could be totally wrong :)
| Reporter | ||
Comment 153•12 years ago
|
||
Ok, Quitter is looking more likely. I have the process hanging, and |pstree| says this:
|-sshd-+-sshd---sshd---bash-+-mock_mozilla---valgrind.sh---make---python-+-firefox---30*[{firefox}]
I.e. Firefox is still running. |top| tells me that it's not consuming any CPU.
So why isn't Quitter installed, has this changed lately, and can I remedy it?
| Reporter | ||
Comment 154•12 years ago
|
||
So objdir/dist/xpi-stage/quitter is present, and has various symlinks back to src/tools/quitter/, so that looks reasonable...
| Reporter | ||
Comment 155•12 years ago
|
||
> So objdir/dist/xpi-stage/quitter is present, and has various symlinks back
> to src/tools/quitter/, so that looks reasonable...
glandium says that's not enough. My best guess is that Quitter isn't running for some reason.
| Reporter | ||
Comment 156•12 years ago
|
||
Running pgo-profile-run locally (where it doesn't hang) I can see Quitter is installed and running. But if I rename the $OBJDIR/dist/xpi-stage/quitter directory to something else, it doesn't get run, and the hang occurs.
I checked the build slave, but $OBJDIR/dist/xpi-stage/quitter is present and its contents are identical to those on my local machine. Which means that Quitter is present and correct, but not being picked up for some reason.
| Reporter | ||
Comment 157•12 years ago
|
||
I re-enabled --vex-iropt-register-updates=allregs-at-each-insn because it's clearly not the problem. Also, it prevents crashes, which confusingly show up as green runs on TBPL due to bug 823787:
https://hg.mozilla.org/build/tools/rev/61bc99912845
| Reporter | ||
Comment 158•12 years ago
|
||
Aha!
http://mxr.mozilla.org/mozilla-central/source/build/pgo/profileserver.py#51 looks like this:
49 profile = FirefoxProfile(profile=profilePath,
50 preferences=prefs,
51 addons=[os.path.join(build.distdir, 'xpi-stage', 'quitter')],
52 locations=locations)
That's what tells Firefox to look in the non-standard place for the Quitter extension.
I'm still trying to work out what goes wrong w.r.t. this code on the build slave.
Comment 159•12 years ago
|
||
Right, that's what I was thinking back in comment 138 and comment 139. I don't know offhand what would have broken it. The code that added quitter dates back 3 months: bug 641829.
Flags: needinfo?(ted)
| Reporter | ||
Comment 160•12 years ago
|
||
> I'm still trying to work out what goes wrong w.r.t. this code on the build
> slave.
The profile directory (which is created in /tmp) disappears earlier in the build slave run, somehow. If I remove the clean-up code on the very last line of profileserver.py, on my local machine the profile directory survives the run, but on the build slave the profile directory doesn't survive the run.
| Reporter | ||
Comment 161•12 years ago
|
||
> The profile directory (which is created in /tmp) disappears earlier in the
> build slave run, somehow. If I remove the clean-up code on the very last
> line of profileserver.py, on my local machine the profile directory survives
> the run, but on the build slave the profile directory doesn't survive the
> run.
Wait, on the build slave I think the try/ directory used in the chroot during the build slave run isn't the same as the standard /try, so that observation is probably bogus.
| Reporter | ||
Comment 162•12 years ago
|
||
So when Firefox starts up, both machines appear to have a valid profile with the Quitter extension installed appropriately. So I want to turn my attention to the loading of add-ons within Firefox, but I'm not sure where/how that happens...
| Reporter | ||
Comment 163•12 years ago
|
||
Here's progress: I've done various bits of logging relating to extension loading. There's something very weird about the file QuitterObserver.js on the build slaves -- the add-on service complains about not being able to get its last-modified-time, and my ad hoc code to query it after installation from profileserver.py fails, even though the parent directory claims to contain it.
So just now I tried changing $OBJDIR/dist/xpi-stage/quitter/components/QuitterObserver.js from a symlink (which points to tools/quitter/QuitterObserver.js) to a real file (by copying tools/quitter/QuitterObserver.js) and *the damn thing runs without hanging*. It works.
WTF?
One curious thing is that on my local machine the symlink has an absolute path:
> lrwxrwxrwx 1 njn njn 50 Dec 7 21:46 QuitterObserver.js -> /home/njn/moz/mc1/tools/quitter/QuitterObserver.js
And on the build slave it has a tortuous relative path:
> lrwxrwxrwx 1 cltbld mock_mozilla 71 Dec 11 12:40 QuitterObserver.js -> ../../../../tools/quitter/../../../src/tools/quitter/QuitterObserver.js
I don't know why this would matter but it seems worth mentioning.
| Reporter | ||
Comment 164•12 years ago
|
||
> I don't know why this would matter but it seems worth mentioning.
I just tried a symlink with an absolute path:
> lrwxrwxrwx 1 cltbld cltbld 81 Dec 11 20:13 QuitterObserver.js -> /builds/slave/m-cen-l64-valgrind-00000000000/src/tools/quitter/QuitterObserver.js
and that worked too.
| Reporter | ||
Comment 165•12 years ago
|
||
I bet this has something to do with the chroot these tests are run in -- that screws up the relative symlink somehow.
| Reporter | ||
Comment 166•12 years ago
|
||
Anti-climax time: it appears that while I was doing all this digging, glandium fixed the problem unintentionally in bug 948275 -- Valgrind runs are currently green on m-i and m-c!
| Reporter | ||
Comment 167•12 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #166)
> Anti-climax time: it appears that while I was doing all this digging,
> glandium fixed the problem unintentionally in bug 948275 -- Valgrind runs
> are currently green on m-i and m-c!
I confirmed this: I updated the build slave's m-c tree and re-ran. No hang, and the symlink is absolute:
> lrwxrwxrwx 1 cltbld mock_mozilla 81 Dec 11 21:20 QuitterObserver.js -> /builds/slave/m-cen-l64-valgrind-00000000000/src/tools/quitter/QuitterObserver.js
| Reporter | ||
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Reporter | ||
Comment 169•12 years ago
|
||
(In reply to TBPL Robot from comment #168)
> KWierso
> https://tbpl.mozilla.org/php/getParsedLog.php?id=35822195&tree=Mozilla-
> Inbound
> Linux x86-64 mozilla-inbound valgrind on 2014-03-07 13:47:33
> revision: 8394fed3332e
> slave: bld-linux64-spot-383
>
> command timed out: 1200 seconds without output, attempting to kill
Huh, that's weird. |mach valgrind-test| produced no output. It's the the same as the previous issue, because the "Quitter is not defined" string doesn't appear in the logs.
| Reporter | ||
Comment 170•12 years ago
|
||
> It's the the same as the previous issue
That should say "It's not the same..."
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 176•11 years ago
|
||
The harness needs to catch these hangs and output unique/identifiable cause for each (similar to how we do this for the other suites), at present, this doesn't meet:
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#6.29_Outputs_failures_in_a_TBPL-starrable_format
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Summary: Valgrind-on-TBPL: "command timed out: 1200 seconds without output, attempting to kill" → Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output, attempting to kill"
Updated•11 years ago
|
Flags: needinfo?(n.nethercote)
| Reporter | ||
Comment 177•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+0] from comment #176)
> The harness needs to catch these hangs and output unique/identifiable cause
> for each (similar to how we do this for the other suites), at present, this
> doesn't meet:
> https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#6.
> 29_Outputs_failures_in_a_TBPL-starrable_format
I guess you're referring to this:
> Exceptions & timeouts must be handled with appropriate log output
> (eg: the failure line must state in which test the timeout occurred,
> not just that the entire run has timed out).
Looking at the hangs in the past month... those in comment 171 and comment 173 both occurred while Valgrind was actually running. The Valgrind test job is unusual in that there aren't distinct tests; it just runs a single thing.
The hangs in comment 172 and comment 174 were on this line of scripts/valgrind/valgrind.sh:
> python $SCRIPTS_DIR/buildfarm/utils/retry.py -s 1 -r 5 -t 3660 \
> python $SCRIPTS_DIR/buildfarm/utils/hgtool.py --rev $REVISION \
> --bundle $HG_BUNDLE $HG_REPO src || exit 2
In both cases an |hg| command timed out.
And the hang in comment 175 is occurring during this command:
> MOZCONFIG=$srcdir/browser/config/mozconfigs/linux${_arch}/valgrind make -f $srcdir/client.mk configure || exit 2
Specifically, during this part of the build:
> /builds/slave/fx-team-l64-valgrind-000000000/src/obj-firefox/_virtualenv/bin/python /builds/slave/fx-team-l64-valgrind-000000000/src/config/expandlibs_exec.py --depend .deps/libjs_static.a.pp --target libjs_static.a --extract -- ar crs libjs_static.a jitprofiling.o RegExp.o Parser.o jsarray.o jsatom.o jsmath.o jsutil.o ExecutableAllocatorPosix.o MacroAssemblerX86Common.o CTypes.o Library.o pm_linux.o Unified_cpp_js_src0.o Unified_cpp_js_src1.o Unified_cpp_js_src2.o Unified_cpp_js_src3.o Unified_cpp_js_src4.o Unified_cpp_js_src5.o Unified_cpp_js_src6.o Unified_cpp_js_src7.o Unified_cpp_js_src8.o Unified_cpp_js_src9.o Unified_cpp_js_src10.o Unified_cpp_js_src11.o ctypes/libffi/.libs/libffi.a ../../intl/icu/target/lib/libicui18n.a ../../intl/icu/target/lib/libicuuc.a ../../intl/icu/target/lib/libicudata.a
which is the linking of libjs_static.a.
I'm happy to hear suggestions how these different kinds of failures (some of which appear to be infra problems) should be reported.
Flags: needinfo?(emorley)
| Reporter | ||
Updated•11 years ago
|
Flags: needinfo?(n.nethercote)
Comment 178•11 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #177)
> I guess you're referring to this:
>
> > Exceptions & timeouts must be handled with appropriate log output
> > (eg: the failure line must state in which test the timeout occurred,
> > not just that the entire run has timed out).
>
> Looking at the hangs in the past month... those in comment 171 and comment
> 173 both occurred while Valgrind was actually running. The Valgrind test job
> is unusual in that there aren't distinct tests; it just runs a single thing.
It would be great if the harness had a max runtime and/or a max time without output (similar to buildbot, but set to something shorter than the buildbot times), so that the harness could kill the valgrind job and output something more identifiable (once shown in the TBPL annotated summary), eg:
TEST-UNEXPECTED-FAIL | valgrind | Timed out during foo
or
Automation Error: Valgrind job exceeded max runtime
or something of that form.
> The hangs in comment 172 and comment 174 were on this line of
> scripts/valgrind/valgrind.sh:
>
> > python $SCRIPTS_DIR/buildfarm/utils/retry.py -s 1 -r 5 -t 3660 \
> > python $SCRIPTS_DIR/buildfarm/utils/hgtool.py --rev $REVISION \
> > --bundle $HG_BUNDLE $HG_REPO src || exit 2
>
> In both cases an |hg| command timed out.
retry.py should timeout & retry N times, after which if there was still no luck, outputting something like:
Automation Error: Timeout cloning from foo
See bug 808536, bug 961048, bug 961042 for examples of fixing this elsewhere in automation.
> And the hang in comment 175 is occurring during this command:
>
> > MOZCONFIG=$srcdir/browser/config/mozconfigs/linux${_arch}/valgrind make -f $srcdir/client.mk configure || exit 2
>
> Specifically, during this part of the build:
>
> > /builds/slave/fx-team-l64-valgrind-000000000/src/obj-firefox/_virtualenv/bin/python /builds/slave/fx-team-l64-valgrind-000000000/src/config/expandlibs_exec.py --depend .deps/libjs_static.a.pp --target libjs_static.a --extract -- ar crs libjs_static.a jitprofiling.o RegExp.o Parser.o jsarray.o jsatom.o jsmath.o jsutil.o ExecutableAllocatorPosix.o MacroAssemblerX86Common.o CTypes.o Library.o pm_linux.o Unified_cpp_js_src0.o Unified_cpp_js_src1.o Unified_cpp_js_src2.o Unified_cpp_js_src3.o Unified_cpp_js_src4.o Unified_cpp_js_src5.o Unified_cpp_js_src6.o Unified_cpp_js_src7.o Unified_cpp_js_src8.o Unified_cpp_js_src9.o Unified_cpp_js_src10.o Unified_cpp_js_src11.o ctypes/libffi/.libs/libffi.a ../../intl/icu/target/lib/libicui18n.a ../../intl/icu/target/lib/libicuuc.a ../../intl/icu/target/lib/libicudata.a
>
> which is the linking of libjs_static.a.
Not sure about this one. Let's skip over it for now.
> I'm happy to hear suggestions how these different kinds of failures (some of
> which appear to be infra problems) should be reported.
Thank you for looking at this - I really do appreciate it :-)
Blocks: 778688
Flags: needinfo?(emorley)
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 188•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=45060194&tree=Fx-Team
Adjusting the summary to pick up the new failure line.
Summary: Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output, attempting to kill" → Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-x86_6"
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Legacy TBPL/Treeherder Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Component: New Frameworks → General
| Assignee | ||
Updated•6 years ago
|
Assignee: nobody → gbrown
Status: REOPENED → RESOLVED
Closed: 12 years ago → 6 years ago
Resolution: --- → INACTIVE
You need to log in
before you can comment on or make changes to this bug.
Description
•