Closed Bug 948145 Opened 12 years ago Closed 6 years ago

Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-x86_6"

Categories

(Testing :: General, defect)

Product:

Component:

Platform:

x86

Linux

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED INACTIVE

People

(Reporter: n.nethercote, Assigned: gbrown)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Nicholas Nethercote [inactive]

Reporter

Description

•

12 years ago

Valgrind-on-TBPL runs are now often timing out, with this message: > command timed out: 1200 seconds without output, attempting to kill https://tbpl.mozilla.org/php/getParsedLog.php?id=31676175&tree=Mozilla-Central is a recent example in which Valgrind didn't produce any output. Here's some of the output: > + make pgo-profile-run 'EXTRA_TEST_ARGS=--debugger=valgrind --debugger-args='\''--error-exitcode=1 --smc-check=all-non-file --gen-suppressions=all --leak-check=full --num-callers=15 --show-possibly-lost=no --track-origins=yes --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/cross-architecture.sup --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/x86_64-redhat-linux-gnu.sup'\''' > /builds/slave/m-cen-l64-valgrind-00000000000/objdir/_virtualenv/bin/python ../src/build/pgo/profileserver.py --debugger=valgrind --debugger-args='--error-exitcode=1 --smc-check=all-non-file --gen-suppressions=all --leak-check=full --num-callers=15 --show-possibly-lost=no --track-origins=yes --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/cross-architecture.sup --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/x86_64-redhat-linux-gnu.sup' > > command timed out: 1200 seconds without output, attempting to kill > program finished with exit code 247 https://tbpl.mozilla.org/php/getParsedLog.php?id=31654180&tree=Mozilla-Central is the first example that I saw. Valgrind produced almost all of its output, with only a bit truncated at the end: > ==1979== LEAK SUMMARY: > ==1979== definitely lost: 9,032 bytes in 7 blocks > ==1979== indirectly lost: 1,552 bytes in 6 blocks > ==1979== possibly lost: 96,747 bytes in 1,749 blocks > ==1979== still reachable: 3,452,056 bytes in 6,091 blocks > ==1979== suppressed: 94,165 bytes in 1,775 blocks > ==1979== Reachable blocks (those to which a pointer was found) ar > command timed out: 1200 seconds without output, attempting to kill > program finished with exit code 247 > elapsedTime=3493.368914

Nicholas Nethercote [inactive]

Reporter

Updated

•

12 years ago

Keywords: intermittent-failure

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nicholas Nethercote [inactive]

Reporter

Comment 15

•

12 years ago

I turned on extra logging (-d -v) in an attempt to understand this better: https://hg.mozilla.org/build/tools/rev/2b49e8cb95c1 In many of the runs, Valgrind isn't even printing its start-up messages. Humph.

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nicholas Nethercote [inactive]

Reporter

Comment 26

•

12 years ago

I just retriggered some old m-c Valgrind runs from Dec 6, which was before the timeout showed up. They timed out. This suggests that it wasn't something that in m-c that changed to start causing the problems. Ideas: - Changes to valgrind.sh in the tools/ repo? Or some other part of the testing infrastructure that is stored outside of mozilla-central? - Changes to the build machines' configurations? And it's really weird the way the valgrind output is being truncated. E.g. the final calls to VG_(debugLog)() are printing some but not all of their output, despite the fact that the actual writing of the output happens in a single call to the write() syscall, using stderr. It doesn't make sense that that operation could be failing/hanging/whatever... it really feels more like the harness that Valgrind is running inside is doing something wrong.

Nicholas Nethercote [inactive]

Reporter

Comment 27

•

12 years ago

Let's try running natively, instead of under Valgrind, in case that tells us anything: https://hg.mozilla.org/build/tools/rev/7ad8a0bfc8cc

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nicholas Nethercote [inactive]

Reporter

Comment 47

•

12 years ago

> it really feels more like the harness that Valgrind is running inside is doing > something wrong. > Let's try running natively, instead of under Valgrind, in case that tells us > anything: The timeout still occurs! E.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=31726798&tree=Mozilla-Inbound Well isn't that interesting.

Nicholas Nethercote [inactive]

Reporter

Comment 48

•

12 years ago

> Well isn't that interesting. Just to clarify: this proves that it's not Valgrind that's hanging. So what other component could be responsible?

Nicholas Nethercote [inactive]

Reporter

Comment 49

•

12 years ago

> Let's try running natively, instead of under Valgrind, in case that tells us > anything: > https://hg.mozilla.org/build/tools/rev/7ad8a0bfc8cc I reverted this change here: https://hg.mozilla.org/build/tools/rev/8e17d3c1e37b

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nicholas Nethercote [inactive]

Reporter

Comment 124

•

12 years ago

I temporarily disabled the |make pgo-profile-run| step, to see how that affects things: https://hg.mozilla.org/build/tools/rev/3db2a940cdc6

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nicholas Nethercote [inactive]

Reporter

Comment 127

•

12 years ago

(In reply to Nicholas Nethercote [:njn] from comment #124) > I temporarily disabled the |make pgo-profile-run| step, to see how that > affects things: > https://hg.mozilla.org/build/tools/rev/3db2a940cdc6 They were green. Not terribly surprising. I reverted the change: https://hg.mozilla.org/build/tools/rev/8a908482428e

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

(not currently active) Ted Mielczarek

Comment 130

•

12 years ago

(In reply to Nicholas Nethercote [:njn] from comment #26) > I just retriggered some old m-c Valgrind runs from Dec 6, which was before > the timeout showed up. They timed out. This suggests that it wasn't > something that in m-c that changed to start causing the problems. Ideas: > > - Changes to valgrind.sh in the tools/ repo? Or some other part of the > testing infrastructure that is stored outside of mozilla-central? > > - Changes to the build machines' configurations? Ick. Maybe catlee can help us find out if anything changed on RelEng's side in that timeframe?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 131

•

12 years ago

I don't see any relevant changes happening in puppet in that time. The primary change is that we switched from running nightly to running per-push. This has resulted in one change to the build environment: MOZ_UPDATE_CHANNEL is unset rather than being set to "nightly".

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 132

•

12 years ago

This results in slightly different arguments to configure: --enable-update-channel=nightly vs --enable-update-channel=

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

(not currently active) Ted Mielczarek

Comment 138

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=31789399&full=1&branch=fx-team has a smoking gun: JavaScript error: http://localhost:8888/index.html, line 55: Quitter is not defined This is: http://mxr.mozilla.org/mozilla-central/source/build/pgo/index.html?force=1#55 If the Quitter extension isn't being installed and we hit that error then it makes sense that it would hang.

(not currently active) Ted Mielczarek

Comment 139

•

12 years ago

I'm wondering if there's something weird going on here where the quitter addon install fails silently and then we hang.

Chris AtLee [:catlee]

Comment 140

•

12 years ago

pretty sure http://hg.mozilla.org/build/tools/rev/c46f324a8b46 is the cause here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try some new jobs.

Chris AtLee [:catlee]

Comment 141

•

12 years ago

pushed http://hg.mozilla.org/build/tools/rev/b89d819d630d

Julian Seward [:jseward]

Comment 142

•

12 years ago

(In reply to Chris AtLee [:catlee] from comment #140) > here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try That strikes me as pretty unlikely to be the root cause, esp given that it hangs when running natively (comment 47). Some non-default setting of --vex-iropt-register-updates= now appears to be pretty much mandatory; see eg bug 913876. I suspect --vex-iropt-register-updates=allregs-at-mem-access is the required minimum now.

Julian Seward [:jseward]

Comment 143

•

12 years ago

> (In reply to Chris AtLee [:catlee] from comment #140) > > here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try Just to be sure, I reran V's testsuite on amd64-linux with this flag set, and didn't get any failures as a result.

Chris AtLee [:catlee]

Comment 144

•

12 years ago

I was able to reproduce the hang on the same build with that flag set, and when I removed it it exited successfully....Maybe there's something else at play here? I see some green V's on inbound prior to me landing. This failed: https://tbpl.mozilla.org/?showall=1&jobname=valgrind&tree=Mozilla-Inbound&rev=316ab101c526 And this passed: https://tbpl.mozilla.org/?showall=1&jobname=valgrind&tree=Mozilla-Inbound&rev=092c3bc22ae8 this code changed: https://hg.mozilla.org/integration/mozilla-inbound/rev/10800c0df8b5

Chris AtLee [:catlee]

Comment 145

•

12 years ago

Oh, I guess that failure is a build failure not a valgrind failure...

Nicholas Nethercote [inactive]

Reporter

Comment 146

•

12 years ago

(In reply to Julian Seward from comment #142) > (In reply to Chris AtLee [:catlee] from comment #140) > > here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try > > That strikes me as pretty unlikely to be the root cause, esp given that > it hangs when running natively (comment 47). I agree. I was seeing hangs prior to setting that flag. Also, I re-triggered an old job from Dec 5 with an updated valgrind.sh which had that flag and it ran successfully. > I see some green V's on inbound prior to me landing. BTW, you can't trust green V's... well, hanging runs always show up as red, but other failures (such as crashes) show up as green due to bug 823787. In fact, if you disable the --vex-iropt-register-updates=allregs-at-each-insn flag then Valgrind will sometimes crash before it hangs, and it'll show up as green, and you'll probably think everything went ok unless you look in the logs. (The crash message involves "clone".)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nicholas Nethercote [inactive]

Reporter

Comment 148

•

12 years ago

I was able to reproduce on a build slave. I had to ctrl-c to kill it. Here's the relevant part of the output: ==10502== LEAK SUMMARY: ==10502== definitely lost: 9,032 bytes in 7 blocks ==10502== indirectly lost: 1,552 bytes in 6 blocks ==10502== possibly lost: 96,747 bytes in 1,749 blocks ==10502== still reachable: 3,449,372 bytes in 6,088 blocks ==10502== suppressed: 94,171 bytes in 1,775 blocks ==10502== Reachable blocks (those to which a pointer was found) are not shown. ==10502== To see them, rerun with: --leak-check=full --show-reachable=yes ==10502== ==10502== For counts of detected and suppressed errors, rerun with: -v ==10502== ERROR SUMMARY: 7 errors from 7 contexts (suppressed: 15 from 15) JavaScript error: http://localhost:8888/index.html, line 55: Quitter is not defined ^CTraceback (most recent call last): File "../src/build/pgo/profileserver.py", line 68, in <module> runner.wait() File "/builds/slave/m-cen-l64-valgrind-00000000000/src/testing/mozbase/mozrunner/mozrunner/runner.py", line 83, in wait self.process_handler.wait(timeout) File "/builds/slave/m-cen-l64-valgrind-00000000000/src/testing/mozbase/mozprocess/mozprocess/processhandler.py", line 741, in wait self.outThread.join(timeout=1) File "/tools/python27/lib/python2.7/threading.py", line 674, in join self.__block.wait(delay) File "/tools/python27/lib/python2.7/threading.py", line 262, in wait _sleep(delay) KeyboardInterrupt WARNING: Command failed. See logs for output. So Valgrind finished (as indicated by the "ERROR SUMMARY" line. The |Quitter is not defined| thing showed up after a while. And then |runner.wait()| ends up hanging.

Nicholas Nethercote [inactive]

Reporter

Comment 149

•

12 years ago

So my current theory is that something change in mozharness. Ted, what do you think, based on comment 148?

Flags: needinfo?(ted)

Aki Sasaki (not active)

Comment 150

•

12 years ago

http://hg.mozilla.org/build/mozharness has all the changes to mozharness; which change do you suspect?

Nicholas Nethercote [inactive]

Reporter

Comment 151

•

12 years ago

> If the Quitter extension isn't being installed and we hit that error then it > makes sense that it would hang. I was discounting that because the "Quitter is not defined" message wasn't showing up in logs of some of the hanging jobs. Though maybe that's because buildbot's buffering is hiding it. But if it was a factor, I wouldn't expect Valgrind to finish? Hmm, not sure.

Nicholas Nethercote [inactive]

Reporter

Comment 152

•

12 years ago

(In reply to Aki Sasaki [:aki] from comment #150) > http://hg.mozilla.org/build/mozharness has all the changes to mozharness; > which change do you suspect? None specifically; it's just a theory and could be totally wrong :)

Nicholas Nethercote [inactive]

Reporter

Comment 153

•

12 years ago

Ok, Quitter is looking more likely. I have the process hanging, and |pstree| says this: |-sshd-+-sshd---sshd---bash-+-mock_mozilla---valgrind.sh---make---python-+-firefox---30*[{firefox}] I.e. Firefox is still running. |top| tells me that it's not consuming any CPU. So why isn't Quitter installed, has this changed lately, and can I remedy it?

Nicholas Nethercote [inactive]

Reporter

Comment 154

•

12 years ago

So objdir/dist/xpi-stage/quitter is present, and has various symlinks back to src/tools/quitter/, so that looks reasonable...

Nicholas Nethercote [inactive]

Reporter

Comment 155

•

12 years ago

> So objdir/dist/xpi-stage/quitter is present, and has various symlinks back > to src/tools/quitter/, so that looks reasonable... glandium says that's not enough. My best guess is that Quitter isn't running for some reason.

Nicholas Nethercote [inactive]

Reporter

Comment 156

•

12 years ago

Running pgo-profile-run locally (where it doesn't hang) I can see Quitter is installed and running. But if I rename the $OBJDIR/dist/xpi-stage/quitter directory to something else, it doesn't get run, and the hang occurs. I checked the build slave, but $OBJDIR/dist/xpi-stage/quitter is present and its contents are identical to those on my local machine. Which means that Quitter is present and correct, but not being picked up for some reason.

Nicholas Nethercote [inactive]

Reporter

Comment 157

•

12 years ago

I re-enabled --vex-iropt-register-updates=allregs-at-each-insn because it's clearly not the problem. Also, it prevents crashes, which confusingly show up as green runs on TBPL due to bug 823787: https://hg.mozilla.org/build/tools/rev/61bc99912845

Nicholas Nethercote [inactive]

Reporter

Comment 158

•

12 years ago

Aha! http://mxr.mozilla.org/mozilla-central/source/build/pgo/profileserver.py#51 looks like this: 49 profile = FirefoxProfile(profile=profilePath, 50 preferences=prefs, 51 addons=[os.path.join(build.distdir, 'xpi-stage', 'quitter')], 52 locations=locations) That's what tells Firefox to look in the non-standard place for the Quitter extension. I'm still trying to work out what goes wrong w.r.t. this code on the build slave.

(not currently active) Ted Mielczarek

Comment 159

•

12 years ago

Right, that's what I was thinking back in comment 138 and comment 139. I don't know offhand what would have broken it. The code that added quitter dates back 3 months: bug 641829.

Flags: needinfo?(ted)

Nicholas Nethercote [inactive]

Reporter

Comment 160

•

12 years ago

> I'm still trying to work out what goes wrong w.r.t. this code on the build > slave. The profile directory (which is created in /tmp) disappears earlier in the build slave run, somehow. If I remove the clean-up code on the very last line of profileserver.py, on my local machine the profile directory survives the run, but on the build slave the profile directory doesn't survive the run.

Nicholas Nethercote [inactive]

Reporter

Comment 161

•

12 years ago

> The profile directory (which is created in /tmp) disappears earlier in the > build slave run, somehow. If I remove the clean-up code on the very last > line of profileserver.py, on my local machine the profile directory survives > the run, but on the build slave the profile directory doesn't survive the > run. Wait, on the build slave I think the try/ directory used in the chroot during the build slave run isn't the same as the standard /try, so that observation is probably bogus.

Nicholas Nethercote [inactive]

Reporter

Comment 162

•

12 years ago

So when Firefox starts up, both machines appear to have a valid profile with the Quitter extension installed appropriately. So I want to turn my attention to the loading of add-ons within Firefox, but I'm not sure where/how that happens...

Nicholas Nethercote [inactive]

Reporter

Comment 163

•

12 years ago

Here's progress: I've done various bits of logging relating to extension loading. There's something very weird about the file QuitterObserver.js on the build slaves -- the add-on service complains about not being able to get its last-modified-time, and my ad hoc code to query it after installation from profileserver.py fails, even though the parent directory claims to contain it. So just now I tried changing $OBJDIR/dist/xpi-stage/quitter/components/QuitterObserver.js from a symlink (which points to tools/quitter/QuitterObserver.js) to a real file (by copying tools/quitter/QuitterObserver.js) and *the damn thing runs without hanging*. It works. WTF? One curious thing is that on my local machine the symlink has an absolute path: > lrwxrwxrwx 1 njn njn 50 Dec 7 21:46 QuitterObserver.js -> /home/njn/moz/mc1/tools/quitter/QuitterObserver.js And on the build slave it has a tortuous relative path: > lrwxrwxrwx 1 cltbld mock_mozilla 71 Dec 11 12:40 QuitterObserver.js -> ../../../../tools/quitter/../../../src/tools/quitter/QuitterObserver.js I don't know why this would matter but it seems worth mentioning.

Nicholas Nethercote [inactive]

Reporter

Comment 164

•

12 years ago

> I don't know why this would matter but it seems worth mentioning. I just tried a symlink with an absolute path: > lrwxrwxrwx 1 cltbld cltbld 81 Dec 11 20:13 QuitterObserver.js -> /builds/slave/m-cen-l64-valgrind-00000000000/src/tools/quitter/QuitterObserver.js and that worked too.

Nicholas Nethercote [inactive]

Reporter

Comment 165

•

12 years ago

I bet this has something to do with the chroot these tests are run in -- that screws up the relative symlink somehow.

Nicholas Nethercote [inactive]

Reporter

Comment 166

•

12 years ago

Anti-climax time: it appears that while I was doing all this digging, glandium fixed the problem unintentionally in bug 948275 -- Valgrind runs are currently green on m-i and m-c!

Nicholas Nethercote [inactive]

Reporter

Comment 167

•

12 years ago

(In reply to Nicholas Nethercote [:njn] from comment #166) > Anti-climax time: it appears that while I was doing all this digging, > glandium fixed the problem unintentionally in bug 948275 -- Valgrind runs > are currently green on m-i and m-c! I confirmed this: I updated the build slave's m-c tree and re-ran. No hang, and the symlink is absolute: > lrwxrwxrwx 1 cltbld mock_mozilla 81 Dec 11 21:20 QuitterObserver.js -> /builds/slave/m-cen-l64-valgrind-00000000000/src/tools/quitter/QuitterObserver.js

Nicholas Nethercote [inactive]

Reporter

Updated

•

12 years ago

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WORKSFORME

Comment hidden (Legacy TBPL/Treeherder Robot)

Nicholas Nethercote [inactive]

Reporter

Comment 169

•

12 years ago

(In reply to TBPL Robot from comment #168) > KWierso > https://tbpl.mozilla.org/php/getParsedLog.php?id=35822195&tree=Mozilla- > Inbound > Linux x86-64 mozilla-inbound valgrind on 2014-03-07 13:47:33 > revision: 8394fed3332e > slave: bld-linux64-spot-383 > > command timed out: 1200 seconds without output, attempting to kill Huh, that's weird. |mach valgrind-test| produced no output. It's the the same as the previous issue, because the "Quitter is not defined" string doesn't appear in the logs.

Nicholas Nethercote [inactive]

Reporter

Comment 170

•

12 years ago

> It's the the same as the previous issue That should say "It's not the same..."

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Ed Morley [:emorley]

Comment 176

•

11 years ago

The harness needs to catch these hangs and output unique/identifiable cause for each (similar to how we do this for the other suites), at present, this doesn't meet: https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#6.29_Outputs_failures_in_a_TBPL-starrable_format

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Summary: Valgrind-on-TBPL: "command timed out: 1200 seconds without output, attempting to kill" → Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output, attempting to kill"

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Updated

•

11 years ago

Flags: needinfo?(n.nethercote)

Nicholas Nethercote [inactive]

Reporter

Comment 177

•

11 years ago

(In reply to Ed Morley [:edmorley UTC+0] from comment #176) > The harness needs to catch these hangs and output unique/identifiable cause > for each (similar to how we do this for the other suites), at present, this > doesn't meet: > https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#6. > 29_Outputs_failures_in_a_TBPL-starrable_format I guess you're referring to this: > Exceptions & timeouts must be handled with appropriate log output > (eg: the failure line must state in which test the timeout occurred, > not just that the entire run has timed out). Looking at the hangs in the past month... those in comment 171 and comment 173 both occurred while Valgrind was actually running. The Valgrind test job is unusual in that there aren't distinct tests; it just runs a single thing. The hangs in comment 172 and comment 174 were on this line of scripts/valgrind/valgrind.sh: > python $SCRIPTS_DIR/buildfarm/utils/retry.py -s 1 -r 5 -t 3660 \ > python $SCRIPTS_DIR/buildfarm/utils/hgtool.py --rev $REVISION \ > --bundle $HG_BUNDLE $HG_REPO src || exit 2 In both cases an |hg| command timed out. And the hang in comment 175 is occurring during this command: > MOZCONFIG=$srcdir/browser/config/mozconfigs/linux${_arch}/valgrind make -f $srcdir/client.mk configure || exit 2 Specifically, during this part of the build: > /builds/slave/fx-team-l64-valgrind-000000000/src/obj-firefox/_virtualenv/bin/python /builds/slave/fx-team-l64-valgrind-000000000/src/config/expandlibs_exec.py --depend .deps/libjs_static.a.pp --target libjs_static.a --extract -- ar crs libjs_static.a jitprofiling.o RegExp.o Parser.o jsarray.o jsatom.o jsmath.o jsutil.o ExecutableAllocatorPosix.o MacroAssemblerX86Common.o CTypes.o Library.o pm_linux.o Unified_cpp_js_src0.o Unified_cpp_js_src1.o Unified_cpp_js_src2.o Unified_cpp_js_src3.o Unified_cpp_js_src4.o Unified_cpp_js_src5.o Unified_cpp_js_src6.o Unified_cpp_js_src7.o Unified_cpp_js_src8.o Unified_cpp_js_src9.o Unified_cpp_js_src10.o Unified_cpp_js_src11.o ctypes/libffi/.libs/libffi.a ../../intl/icu/target/lib/libicui18n.a ../../intl/icu/target/lib/libicuuc.a ../../intl/icu/target/lib/libicudata.a which is the linking of libjs_static.a. I'm happy to hear suggestions how these different kinds of failures (some of which appear to be infra problems) should be reported.

Flags: needinfo?(emorley)

Nicholas Nethercote [inactive]

Reporter

Updated

•

11 years ago

Flags: needinfo?(n.nethercote)

Ed Morley [:emorley]

Comment 178

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #177) > I guess you're referring to this: > > > Exceptions & timeouts must be handled with appropriate log output > > (eg: the failure line must state in which test the timeout occurred, > > not just that the entire run has timed out). > > Looking at the hangs in the past month... those in comment 171 and comment > 173 both occurred while Valgrind was actually running. The Valgrind test job > is unusual in that there aren't distinct tests; it just runs a single thing. It would be great if the harness had a max runtime and/or a max time without output (similar to buildbot, but set to something shorter than the buildbot times), so that the harness could kill the valgrind job and output something more identifiable (once shown in the TBPL annotated summary), eg: TEST-UNEXPECTED-FAIL | valgrind | Timed out during foo or Automation Error: Valgrind job exceeded max runtime or something of that form. > The hangs in comment 172 and comment 174 were on this line of > scripts/valgrind/valgrind.sh: > > > python $SCRIPTS_DIR/buildfarm/utils/retry.py -s 1 -r 5 -t 3660 \ > > python $SCRIPTS_DIR/buildfarm/utils/hgtool.py --rev $REVISION \ > > --bundle $HG_BUNDLE $HG_REPO src || exit 2 > > In both cases an |hg| command timed out. retry.py should timeout & retry N times, after which if there was still no luck, outputting something like: Automation Error: Timeout cloning from foo See bug 808536, bug 961048, bug 961042 for examples of fixing this elsewhere in automation. > And the hang in comment 175 is occurring during this command: > > > MOZCONFIG=$srcdir/browser/config/mozconfigs/linux${_arch}/valgrind make -f $srcdir/client.mk configure || exit 2 > > Specifically, during this part of the build: > > > /builds/slave/fx-team-l64-valgrind-000000000/src/obj-firefox/_virtualenv/bin/python /builds/slave/fx-team-l64-valgrind-000000000/src/config/expandlibs_exec.py --depend .deps/libjs_static.a.pp --target libjs_static.a --extract -- ar crs libjs_static.a jitprofiling.o RegExp.o Parser.o jsarray.o jsatom.o jsmath.o jsutil.o ExecutableAllocatorPosix.o MacroAssemblerX86Common.o CTypes.o Library.o pm_linux.o Unified_cpp_js_src0.o Unified_cpp_js_src1.o Unified_cpp_js_src2.o Unified_cpp_js_src3.o Unified_cpp_js_src4.o Unified_cpp_js_src5.o Unified_cpp_js_src6.o Unified_cpp_js_src7.o Unified_cpp_js_src8.o Unified_cpp_js_src9.o Unified_cpp_js_src10.o Unified_cpp_js_src11.o ctypes/libffi/.libs/libffi.a ../../intl/icu/target/lib/libicui18n.a ../../intl/icu/target/lib/libicuuc.a ../../intl/icu/target/lib/libicudata.a > > which is the linking of libjs_static.a. Not sure about this one. Let's skip over it for now. > I'm happy to hear suggestions how these different kinds of failures (some of > which appear to be infra problems) should be reported. Thank you for looking at this - I really do appreciate it :-)

Blocks: 778688

Flags: needinfo?(emorley)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nicholas Nethercote [inactive]

Reporter

Updated

•

11 years ago

Depends on: 1029204

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Comment 188

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=45060194&tree=Fx-Team Adjusting the summary to pick up the new failure line.

Summary: Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output, attempting to kill" → Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-x86_6"

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Intermittent Failures Robot)

Comment hidden (Intermittent Failures Robot)

Comment hidden (Intermittent Failures Robot)

Comment hidden (Intermittent Failures Robot)

Nobody; OK to take it and work on it

Updated

•

8 years ago

Component: New Frameworks → General

Geoff Brown [:gbrown]

Assignee

Updated

•

6 years ago

Assignee: nobody → gbrown

Status: REOPENED → RESOLVED

Closed: 12 years ago → 6 years ago

Resolution: --- → INACTIVE

You need to log in before you can comment on or make changes to this bug.