mochitest-plain timeout (hanging?) during shutdown on Windows

RESOLVED FIXED in mozilla1.9.3a1

Status

()

Core
DOM
RESOLVED FIXED
9 years ago
5 years ago

People

(Reporter: dbaron, Assigned: Ben Turner (not reading bugmail, use the needinfo flag!))

Tracking

({intermittent-failure})

Trunk
mozilla1.9.3a1
x86
Windows Server 2003
intermittent-failure
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(status1.9.2 beta1-fixed, status1.9.1 unaffected)

Details

Attachments

(4 attachments)

In the last few mochitest-plain runs, there's been a form of random orange that I haven't seen recently.  During the shutdown of the run, after printing:

92847 INFO Passed: 86162
92848 INFO Failed: 1
92849 INFO Todo:   1272
92850 INFO SimpleTest FINISHED


instead of recording the shutdown properly, we get:

command timed out: 1200 seconds without output
program finished with exit code 1


This makes the run orange.


This occurred on:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1248256012.1248265493.6079.gz
WINNT 5.2 mozilla-central unit test on 2009/07/22 02:46:52  

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1248259996.1248266626.19205.gz
WINNT 5.2 mozilla-central unit test on 2009/07/22 03:53:16  

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1248263042.1248266049.12600.gz
WINNT 5.2 mozilla-central test mochitests on 2009/07/22 04:44:02  

(The timeout is 300 seconds on the "unit test" boxes on Firefox and 1200 seconds on the "mochitests" boxes on Firefox-Unittest.)
(It's possible this is related to bug 497053, since all three occurred in builds that also had that, although there have been many many more builds with that in the past 12 hours without this.)
Whiteboard: [orange]
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1248262168.1248269456.18193.gz
WINNT 5.2 mozilla-central unit test on 2009/07/22 04:29:28  

(also showed bug 497053, but the last cycle that should)
Also happened in:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1248269038.1248271997.14905.gz
WINNT 5.2 mozilla-central test mochitests on 2009/07/22 06:23:58  

which is post-bug 497053 (so did not show it), but did show bug 505217.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1248336876.1248347373.12002.gz
WINNT 5.2 mozilla-central unit test on 2009/07/23 01:14:36  

... happened here in a log with no other test failures.

Comment 8

9 years ago
WINNT 5.2 mozilla-central unit test on 2009/07/23
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1248385103.1248391746.5955.gz&fulltext=1
Looking at the buildbot logs we have going back to July 18, the problem is
* is win32 only
* has occurred 21 times in mozilla-central unittest builds
 * first example is Mon Jul 20 22:01:00 2009 PDT (rev 682906e2a48b)
 * happens in roughly one third of the builds
* has occurred 20 times in mozilla-central packaged mochitest runs
 * first example is Tue Jul 21 04:07:41 2009 PDT (rev e50cbaef1d8f)
 * about 1/3 of builds again
* occurred once on tracemonkey, at Thu Jul 23 14:24:16 2009 (rev 9cf9a10f7e49, looks like a merge from mozilla-central), only 5 builds since then so the stats aren't super strong

Which points to the first few checkins at the top of 
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d6895cb5ac3b&tochange=682906e2a48b
Fwiw, the symptom reminds me of bug 471085;
but I don't know if there is a "video" regression here.
This has happened a bunch this morning already, in addition to the ones that Serge reported.

WINNT 5.2 mozilla-central test mochitests on 2009/07/27 07:42:56
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1248705776.1248708876.31326.gz

WINNT 5.2 mozilla-central unit test on 2009/07/27 08:00:58
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1248706858.1248715866.13640.gz

WINNT 5.2 mozilla-central test mochitests on 2009/07/27 09:52:18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1248713538.1248716755.23655.gz

Comment 14

9 years ago
WINNT 5.2 mozilla-central test mochitests on 2009/07/28 12:26:20  
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1248809180.1248812108.11796.gz
Do any of the changes in comment #9 seem likely to cause problems ? Would it be helpful to repeat the analysis there ?
Depends on: 501034
I _may_ have caught this with windbg. Firefox was chugging away at 100% CPU, and after attaching and .dump /mf there is this
  http://people.mozilla.org/~nthomas/misc/mochitest-hang.dmp.bz2
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249427525.1249433303.11673.gz#err0
Summary: mochitest-plain timing out (hanging?) during shutdown on Windows → mochitest-plain timeout (hanging?) during shutdown on Windows
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249483284.1249489687.30260.gz
WINNT 5.2 mozilla-central unit test on 2009/08/05 07:41:24
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249516254.1249522258.3598.gz
WINNT 5.2 mozilla-central unit test on 2009/08/05 16:50:54  


(FWIW, I looked at the file in comment 16 but I have no idea what kind of file it is or what to do with it; maybe somebody with more Windows knowledge does.)
It's a minidump, you should be able to load it in WinDBG or the Visual C++ debugger, load the symbols from the symbol server, and get a stack trace.
Oh, crap. I realized this is from a unittest build, which means the symbols aren't on the symbol server. :-/

Nick: how did you notice this? Just dumb luck? If you can grab this again it would probably be better if you could just follow the steps here:
https://developer.mozilla.org/en/How_to_get_a_stacktrace_with_WinDbg
Indeed, that was the doc what I was trying to use, with the symbol path updated for the location of the unpacked crashreporter-symbols.zip. WinDbg was failing to resolve the symbols regardless, falling back to export symbols. It's not hard to reproduce this so if someone would to hold my hand on IRC we can try to get some info.
Oh, one difference from that doc was attaching to the process while it was churning away post-mochitest, rather than launching it from the debugger.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1249529101.1249531761.8411.gz
WINNT 5.2 mozilla-central test mochitests on 2009/08/05 20:25:01
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1249545017.1249547678.16295.gz
WINNT 5.2 mozilla-central test mochitests on 2009/08/06 00:50:17
I'd be happy to hand-hold, but our timezone difference is generally fail. Regardless, if you have questions, you know where to find me.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249751531.1249760853.13878.gz
WINNT 5.2 mozilla-central unit test on 2009/08/08 10:12:11
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249769074.1249775146.5829.gz
WINNT 5.2 mozilla-central unit test on 2009/08/08 15:04:34
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1249949013.1249951952.9560.gz
WINNT 5.2 mozilla-central test mochitests on 2009/08/10 17:03:33
Created attachment 393704 [details]
Hang analysis & stack

Courtesy of WinDbg and timeless's help.

<timeless_mbp>	your problem is deadlock
<timeless_mbp>	thread 11 is more interesting
<timeless_mbp>  afaict thread 1 is trying to have threads like 11 die
<timeless_mbp>  but it's unclear as to whether thread 11 was sent or missed the message
So it's DOM workers
Suggestions for reproducing the problem locally:
* from http://ftp.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-win32-unittest/ d/l the build, tests and crash symbols 
* unpack them so that you have a firefox/, mochitest/ and symbols/
* expand the symbols files, eg find symbols/ -name '*.pd_' | perl -nle '$a=$_;$a=~s/_$/b/; print $a;system("expand $_ $a")'
* build slaves have the following defined in the environment (although I doubt all are relevant)
export MOZ_AIRBAG=1
export MOZ_CRASHREPORTER_NO_REPORT=1
export MOZ_NO_REMOTE=1
export MOZ_NO_RESET_PATH=1
export NO_EM_RESTART=1
export NO_FAIL_ON_TEST_ERRORS=1
export XPCOM_DEBUG_BREAK=warn
* build slaves run mochitest like this:
python mochitest/runtests.py --appname=firefox/firefox.exe --utility-path=bin --extra-profile-file=bin/plugins --certificate-path=certs --autorun --close-when-done --console-level=INFO --symbols-path=symbols --leak-threshold=484
* in windbg, attach to firefox, set your sympath, symfix it, reload (a la https://developer.mozilla.org/en/How_to_get_a_stacktrace_with_WinDbg)
Component: General → DOM: Mozilla Extensions
QA Contact: general → general
I'll certainly look at this tomorrow, but two things I want to point out/ask:

1. Originally the report was that firefox was churning up to 100% cpu. Is that what was happening here before you attached to the process? If so that means that something else must be sending lots of windows messages... I would suspect java somehow but see #2...

2. There are lots of threads that have warnings saying their stacks may be incorrect (all the JVM ones, for instance). Also, thread 11 doesn't start with |kernel32!BaseThreadStart| like all the others. What do we make of that? The first few frames are obviously wrong, |PL_HashTableRawLookup| doesn't ever call |nsDOMWorkerMessageHandler::DispatchEvent|. Could that be a weird result of JIT being enabled?
(In reply to comment #38)
> 1. Originally the report was that firefox was churning up to 100% cpu. Is that
> what was happening here before you attached to the process? 

That's right, using all the time the system would give it. Note that Java was updated to Java SE 6 Update 14 a month before this issue appeared (bug 495533).
Ok, I think I know what is wrong.
Assignee: nobody → bent.mozilla
Status: NEW → ASSIGNED
Created attachment 394114 [details] [diff] [review]
Fix

I <3 threads.
Attachment #394114 - Flags: superreview?(jst)
Attachment #394114 - Flags: review?(jst)
Comment on attachment 394114 [details] [diff] [review]
Fix

This looks good to me. There's a subtle hint that bent gave me over IRC: the worker can only become canceled if the pool's monitor is held, which eliminates the race.
Attachment #394114 - Flags: review?(jst) → review+
Comment on attachment 394114 [details] [diff] [review]
Fix

sr=jst. Please add the subtle hint you gave Blake in a comment in this code when landing...
Attachment #394114 - Flags: superreview?(jst) → superreview+
Pushed changeset 340fe75c03dd to mozilla-central.

Crap, I pushed without seeing your request for a comment. I'll add one.
Status: ASSIGNED → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
Pushed the comment to mozilla-central, changeset e53964cf3d20.
Target Milestone: --- → mozilla1.9.2b1
I hate to say it, but it looks like this hasn't fixed the problems on tinderbox; it's happened 4 times since the fix landed:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1250166961.1250170279.31431.gz
WINNT 5.2 mozilla-central test mochitests on 2009/08/13 05:36:01  

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250151248.1250159957.14134.gz
WINNT 5.2 mozilla-central unit test on 2009/08/13 01:14:08  

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250150308.1250160956.24829.gz
WINNT 5.2 mozilla-central unit test on 2009/08/13 00:58:28  

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1250154536.1250157623.21324.gz
WINNT 5.2 mozilla-central test mochitests on 2009/08/13 02:08:56
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1250171957.1250181773.12666.gz
WINNT 5.2 mozilla-1.9.2 unit test on 2009/08/13 06:59:17

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1250178579.1250188593.26996.gz
WINNT 5.2 mozilla-1.9.2 unit test on 2009/08/13 08:49:39

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250172703.1250182275.18351.gz
WINNT 5.2 mozilla-central unit test on 2009/08/13 07:11:43
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1250176437.1250179666.19299.gz
WINNT 5.2 mozilla-central test mochitests on 2009/08/13 08:13:57  

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1250182257.1250185320.21382.gz
WINNT 5.2 mozilla-central test mochitests on 2009/08/13 09:50:57
Created attachment 394430 [details]
Hang analysis & stack post after rev 340fe75c03dd

This is from a build with rev 93a7af291db1, it took about 4 attempts to get it to hang on shutdown.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1250203551.1250206219.29584.gz
WINNT 5.2 mozilla-central test mochitests on 2009/08/13 15:45:51  

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1250214571.1250217372.27737.gz
WINNT 5.2 mozilla-central test mochitests on 2009/08/13 18:49:31
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250273088.1250283600.6762.gz
WINNT 5.2 mozilla-central unit test on 2009/08/14 11:04:48  

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1250279366.1250282607.27944.gz
WINNT 5.2 mozilla-central test mochitests on 2009/08/14 12:49:26  

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250274727.1250283146.1277.gz
WINNT 5.2 mozilla-central unit test on 2009/08/14 11:32:07
Ben asked for a ".dump /ma" from the build slave, which is here
   http://people.mozilla.org/~nthomas/bug505718/hang1/
with symbols etc. The stack in firefox-debug_16FC_2009-08-16_17-03-19-643.log looks different to attachment 394430 [details], so let me know if you want more dumps to explore the variations.
I pushed changeset bb1b6c42d78a, fingers crossed!
(In reply to comment #66)
> I pushed changeset bb1b6c42d78a, fingers crossed!

Fired off 8 extra packaged-mochitest runs (M on tbpl) on this revision to try to get good stats. Results in 30 minutes or so.
How does 9 green runs grab you ?
Thanks Nick! Couldn't have figured that out without your help.
Status: REOPENED → RESOLVED
Last Resolved: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: mozilla1.9.2b1 → mozilla1.9.3a1
Created attachment 395073 [details] [diff] [review]
Followup patch

This had r+sr=jst over irc.
Attachment #395073 - Flags: superreview+
Attachment #395073 - Flags: review+
Attachment #394114 - Flags: approval1.9.2?
Attachment #395073 - Flags: approval1.9.2?
Duplicate of this bug: 512344
Who can approve these two patches for 1.9.2 ?
Flags: blocking1.9.2?
sorry, second line in my previous comment was supposed to be:
WINNT 5.2 mozilla-1.9.2 unit test on 2009/09/17 10:42:14
(In reply to comment #72)
> Who can approve these two patches for 1.9.2 ?

Maybe jst could?

I think component owners have approval-granting power. (At least, I remember someone telling roc in a recent platform meeting that roc could grant a1.9.2 for layout bugs.)
Attachment #394114 - Flags: approval1.9.2? → approval1.9.2+
Comment on attachment 394114 [details] [diff] [review]
Fix

a=me to stop the orange
Attachment #395073 - Flags: approval1.9.2? → approval1.9.2+
Attachment #394114 - Flags: approval1.9.2+
Comment on attachment 394114 [details] [diff] [review]
Fix

This landed before 1.9.2 branching. Oops.
status1.9.1: --- → unaffected
status1.9.2: --- → final-fixed
Flags: blocking1.9.2?
Comment on attachment 395073 [details] [diff] [review]
Followup patch

Pushed changeset 10a7ac8d7038 to mozilla-1.9.2.
Should be all done now.
status1.9.2: final-fixed → beta1-fixed
See also Bug 523319, which appears to be a new occurrence of this type of bug.
Keywords: intermittent-failure
Whiteboard: [orange]
Component: DOM: Mozilla Extensions → DOM
Product: Core → Core
You need to log in before you can comment on or make changes to this bug.