Closed Bug 1011786 Opened 10 years ago Closed 9 years ago

Intermittent parse-array-gc.js | application timed out after 330 seconds with no output

Categories

(Core :: JavaScript Engine, defect)

x86_64
Windows 8.1
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla45
Tracking Status
firefox45 --- fixed

People

(Reporter: KWierso, Assigned: jonco)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(2 files, 1 obsolete file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=39846632&tree=Mozilla-Inbound
WINNT 6.2 mozilla-inbound debug test jsreftest on 2014-05-16 14:27:27 PDT for push 505f38b0649b

slave: t-w864-ix-077



14:42:33     INFO -  REFTEST TEST-START | file:///C:/slave/test/build/tests/jsreftest/tests/jsreftest.html?test=ecma_5/JSON/parse-array-gc.js
14:42:33     INFO -  REFTEST TEST-LOAD | file:///C:/slave/test/build/tests/jsreftest/tests/jsreftest.html?test=ecma_5/JSON/parse-array-gc.js | 1104 / 6755 (16%)
14:42:33     INFO -  ++DOMWINDOW == 335 (194449B0) [pid = 1176] [serial = 2189] [outer = 0B7BB0D0]
14:42:33     INFO -  852563: IdValuePair::value should be initialized to avoid GC sequence-point issues
14:42:34     INFO -  Note: You must run this test under valgrind to be certain it passes
14:42:34     INFO -  Tests complete

[Bunch of debug spew omitted]

14:48:06  WARNING -  TEST-UNEXPECTED-FAIL | file:///C:/slave/test/build/tests/jsreftest/tests/jsreftest.html?test=ecma_5/JSON/parse-array-gc.js | application timed out after 330 seconds with no output
14:48:06     INFO -  TEST-INFO | screenshot: exit status 0
14:48:06     INFO -  TEST-INFO | crashinject: exit status 0
14:48:08     INFO -  TEST-INFO | Main app process: exit status c0000005
14:48:08     INFO -  INFO | automation.py | Application ran for: 0:12:14.847000
14:48:08     INFO -  INFO | zombiecheck | Reading PID log: c:\users\cltbld~1.t-w\appdata\local\temp\tmp4hji1vpidlog
14:48:16  WARNING -  PROCESS-CRASH | file:///C:/slave/test/build/tests/jsreftest/tests/jsreftest.html?test=ecma_5/JSON/parse-array-gc.js | application crashed [@ CrashingThread(void *)]
14:48:16     INFO -  Crash dump filename: c:\users\cltbld~1.t-w\appdata\local\temp\tmpse0gvd.mozrunner\minidumps\1cf52745-ee2f-422c-ab41-d0cf1b81dbec.dmp
14:48:16     INFO -  Operating system: Windows NT
14:48:16     INFO -                    6.2.9200
14:48:16     INFO -  CPU: x86
14:48:16     INFO -       GenuineIntel family 6 model 30 stepping 5
14:48:16     INFO -       8 CPUs
14:48:16     INFO -  Crash reason:  EXCEPTION_ACCESS_VIOLATION_WRITE
14:48:16     INFO -  Crash address: 0x0
14:48:16     INFO -  Thread 30 (crashed)
14:48:16     INFO -   0  crashinjectdll.dll!CrashingThread(void *) [crashinjectdll.cpp:505f38b0649b : 17 + 0x0]
14:48:16     INFO -      eip = 0x6dcb1000   esp = 0x0da0f8c8   ebp = 0x0da0f8d0   ebx = 0x00000000
14:48:16     INFO -      esi = 0x00000000   edi = 0x00000000   eax = 0x76ad8535   ecx = 0x00000000
14:48:16     INFO -      edx = 0x6dcb1000   efl = 0x00010246
14:48:16     INFO -      Found by: given as instruction pointer in context
14:48:16     INFO -   1  kernel32.dll + 0x28542
14:48:16     INFO -      eip = 0x76ad8543   esp = 0x0da0f8cc   ebp = 0x0da0f8d0
14:48:16     INFO -      Found by: call frame info
14:48:16     INFO -   2  ntdll.dll + 0x5ac68
14:48:16     INFO -      eip = 0x77c1ac69   esp = 0x0da0f8d8   ebp = 0x0da0f914
14:48:16     INFO -      Found by: previous frame's frame pointer
14:48:16     INFO -   3  ntdll.dll + 0x5ac3b
14:48:16     INFO -      eip = 0x77c1ac3c   esp = 0x0da0f91c   ebp = 0x0da0f92c
14:48:16     INFO -      Found by: previous frame's frame pointer
Terrence, this is a pretty regular Windows failure. Can you please take a look or assign someone else?
Flags: needinfo?(terrence)
I think I just fixed this iloop in bug 1055163. Please n-i me again if you have to star another one like this!
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(terrence)
Resolution: --- → DUPLICATE
Comment 39 is definitely from after bug 1055163 was merged around.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #40)
> Comment 39 is definitely from after bug 1055163 was merged around.

The new stacks appear to be somewhat different from the couple I looked at from before. Appears to be a second bug, this time a GC triggered from ~AutoEntryScript while QI'ing an nsWeakReference.
I have a try push that hit this failure 6/10 times: https://tbpl.mozilla.org/?tree=Try&rev=d26c09efa035 (It's a weird build, VS2013 Debug PGO -- I forgot that my try patch had the PGO flag!)

Terrence, does a mostly-reproducible binary help you? That build even has full PDBs.
Flags: needinfo?(terrence)
(In reply to David Major [:dmajor] from comment #53)
> I have a try push that hit this failure 6/10 times:
> https://tbpl.mozilla.org/?tree=Try&rev=d26c09efa035 (It's a weird build,
> VS2013 Debug PGO -- I forgot that my try patch had the PGO flag!)
> 
> Terrence, does a mostly-reproducible binary help you? That build even has
> full PDBs.

Yes indeed! If I can get it in a debugger, that would be very helpful. I've requested some re-runs on win8 so we can see if it happens there too. Would be very convenient, since I already have a dev environment on my desk for win8, but I could dig up a win7 box if needed.
Flags: needinfo?(terrence)
Looks like no: win7 only. What could even cause that?
IIRC, our Win7 slaves are 32bit and Win8 is 64bit. So address space limitations?
The machines look OK on address space, although I guess it could affect memory layout.

TotalVirtualMemory=2147352576 AvailableVirtualMemory=1663873024
TotalVirtualMemory=2147352576 AvailableVirtualMemory=1656782848
TotalVirtualMemory=2147352576 AvailableVirtualMemory=1663553536
Summary: Intermittent parse-array-gc.js | application crashed [@ CrashingThread(void *)] → Intermittent parse-array-gc.js | application timed out after 330 seconds with no output
I have a 32-bit Win7 VM if you want me to try any experiments. Is there an easy way to run this test on a vanilla machine?
Flags: needinfo?(terrence)
(In reply to David Major [:dmajor] from comment #74)
> I have a 32-bit Win7 VM if you want me to try any experiments. Is there an
> easy way to run this test on a vanilla machine?

If we are lucky, it will reproduce under the js shell:
js/src/tests/jstests.py path/to/js --tbpl ecma_5/JSON/parse-array-gc.js

If not, then you are going to need to craft a manifest to pass to firefox with -reftests.

You'll need a file |jstests.list| adjacent to the test like:
url-prefix ../../jsreftest.html?test=ecma_5/JSON/
script parse-array-gc.js

Then you'll want to run firefox with -reftest pointing at the right jstests.list. There used to be a Makefile target for this that showed the raw command to ff. The best I was able to manage today was:  

/home/terrence/moz/branch/t/optdbgnightly-x86_64-unknown-linux-gnu/_virtualenv/bin/python _tests/reftest/runreftest.py --extra-profile-file=dist/plugins --symbols-path=dist/crashreporter-symbols  '/home/terrence/moz/branch/t/js/src/tests/ecma_5/JSON/jstests.list'
Flags: needinfo?(terrence)
I didn't get a chance to look at this before the try build aged out. I pushed a new build with similar settings on the same parent: https://tbpl.mozilla.org/?tree=Try&rev=c4c3c849dc3f
I couldn't reproduce this with the comment 109 build on my Win7 32-bit VM, set to 3GB RAM like the test machines. I guess our only hope is a loaner :(
ni? our GC folks for this ongoing timeout
Flags: needinfo?(terrence)
Flags: needinfo?(jcoppeard)
Looking at the most recent backtrace, it appears we are marking a string and crashing via an assertion in Cell::address under CheckTracedThing. The line in CheckTracedThing is |MOZ_ASSERT(thing->isAligned())|; we're failing before we get there in Cell::address(), but the higher assertion would presumably fail too if we reached it, because the address assertions are:

    MOZ_ASSERT(addr % CellSize == 0);
    MOZ_ASSERT(Chunk::withinArenasRange(addr));

So this is some sort of corrupt or old pointer; sadly the line information is not good enough to tell which assertion is failing. The string itself is stored inside of a JSScript. The JSScript was reached via a JSFunction on the mark stack. Looking at other failures, they all appear to be string failures, but they don't seem to be reached in the same ways, so I doubt scripts or functions are directly implicated.

Yesterday, I greatly expanded the assertions we make on the internals of Ropes; maybe that will increase the failure rate here and get us more information.

Steve, might this be related to the rope bug you were looking into a few months (years?) ago?
Flags: needinfo?(sphink)
Note that this is not a 32-bit only problem. There are a fair number of 64-bit test runs mixed in. As well as some OSX test runs, though I didn't look closely to see if they're really the same. The symptoms don't seem to be the same across all of these, so there may be multiple things getting lumped in here.

The reported stack is

js::GCMarker::mark<JSString>(JSString *) [Marking.cpp:fd468f69c6a2 : 799 + 0x0]
DoMarking<JSString *>(js::GCMarker *,JSString *) [Marking.cpp:fd468f69c6a2 : 661 + 0x13] 
JSFunction::trace(JSTracer *) [jsfun.cpp:fd468f69c6a2 : 684 + 0xe]

The only way I see to get from marking a JSScript* to marking a JSString* without any frames in between is to mark the function's atom_. Marking the script from the function should push the script onto the mark stack, not process it immediately, so the part of comment 234 that says "The string itself is stored inside of a JSScript" looks incorrect to me. So it looks like a JSFunction::atom_ field is corrupt.

If atom_ were corrupt during normal execution, I would expect it to hang when it was used (eg if a rope's child pointed back to itself or something). That makes me suspect it is getting corrupted during GC, but I have no proof. But! If this *were* something like a rope's DAG getting a cycle, then I would expect the hang stack to be under GCMarker::eagerlyMarkChildren(JSRope* rope). Which it might be; that function is marked inline. I'd sort of like to know whether this is happening, so I'm going to try adding an assert that the depth of the rope doesn't get out of hand.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f918a40383d8
100k is chosen arbitrarily. I guess I'll see with the try push if we intentionally push the limits of nesting craziness.
Attachment #8654431 - Flags: review?(terrence)
Assignee: nobody → sphink
Status: REOPENED → ASSIGNED
Comment on attachment 8654431 [details] [diff] [review]
Diagnostic patch to detect cyclic ropes

Review of attachment 8654431 [details] [diff] [review]:
-----------------------------------------------------------------

Great idea!
Attachment #8654431 - Flags: review?(terrence) → review+
The simplistic patch found a couple of tests where we intentionally create very deep ropes. Asking for re-review since my fix for that is fully within the "excessively clever" bucket, and I also wanted an opinion on whether this might be a good idea for a JS_DIAGNOSTICS_ASSERT. I *think* it ought to be minimal overhead, but it does bloat up the stack frame to be pretty large (800 extra bytes), and so might defeat inlining or something. I guess awfy would catch that immediately, though.

Passes try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1b39f0d06168
Attachment #8655082 - Flags: review?(terrence)
Attachment #8654431 - Attachment is obsolete: true
Comment on attachment 8655082 [details] [diff] [review]
Diagnostic patch to detect cyclic ropes

Review of attachment 8655082 [details] [diff] [review]:
-----------------------------------------------------------------

Wow, that's... I guess, let's try it.
Attachment #8655082 - Flags: review?(terrence) → review+
(In reply to Steve Fink [:sfink, :s:] from comment #294)
> If atom_ were corrupt during normal execution, I would expect it to hang
> when it was used (eg if a rope's child pointed back to itself or something).

Atoms are never ropes though so if atom_ appears to be a rope, maybe its flags are corrupt?
https://hg.mozilla.org/mozilla-central/rev/5b59f670cf52
Status: ASSIGNED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
AFAICT, this should have been leave-open.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla43 → ---
(In reply to Jan de Mooij [:jandem] from comment #299)
> (In reply to Steve Fink [:sfink, :s:] from comment #294)
> > If atom_ were corrupt during normal execution, I would expect it to hang
> > when it was used (eg if a rope's child pointed back to itself or something).
> 
> Atoms are never ropes though so if atom_ appears to be a rope, maybe its
> flags are corrupt?

A very good point! Sadly, that argues for generalized corruption, which is a lot harder to track down.

The diagnostic patch I landed here should at least convert these hangs into crashes. I guess they'll go into a different intermittent bug, though. At least the debug builds will.

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #303)
> AFAICT, this should have been leave-open.

You are correct, thank you.
Flags: needinfo?(terrence)
Flags: needinfo?(jcoppeard)
These are timing when run in the browser and the last line of the log before the timeout shows 'Tests complete'.  Also this tests sets gczeal(2).  My guess is that after the test script has finished and before the test harness resets the zeal, the browser tries to do some allocation intensive operation which gets bogged down in constant GCs.

I grepped for other jstests that set zeal and of the ~20 that do, this is the only one not to also reset it in the test script itself.  So although I don't know that this will fix things, I'm hopeful.

Note that this test calls gczeal(2, 1) but the frequency argument is silently ignored by the test harness!
Assignee: sphink → jcoppeard
Attachment #8691366 - Flags: review?(terrence)
Comment on attachment 8691366 [details] [diff] [review]
bug1011786-reftest-timeout

Review of attachment 8691366 [details] [diff] [review]:
-----------------------------------------------------------------

Wow, good find!
Attachment #8691366 - Flags: review?(terrence) → review+
https://hg.mozilla.org/mozilla-central/rev/226fe77322b7
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
Clearing what I *think* is a stale needinfo at this point.
Flags: needinfo?(sphink)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: