Closed Bug 692605 Opened 12 years ago Closed 12 years ago

Perma-orange Aurora (and Beta) Linux opt (non-PGO) browser_webconsole_bug_595934_message_categories.js | Test timed out and test_errorPropagation.html | Test timed out followed by shutdown timeout

Categories

(Core :: XPCOM, defect)

x86
Linux
defect
Not set
blocker

Tracking

()

RESOLVED FIXED
mozilla10
Tracking Status
firefox7 --- unaffected
firefox8 --- fixed
firefox9 --- fixed
firefox10 --- fixed

People

(Reporter: mbrubeck, Assigned: bent.mozilla)

References

Details

(Keywords: intermittent-failure, Whiteboard: [qa-] [inbound][see comment #111])

Attachments

(1 file)

Ever since PGO was turned off for opt builds (bug 658313), these failures are occurring in every non-PGO Linux 32-bit opt build on Aurora (Firefox 9).

We're pretty sure that non-PGO is to blame.  The failures do not occur in PGO builds.  They do occur in non-PGO builds even on pushes that were green before the config change, and the failures kept occurring after backing out the changeset where they first occurred.

https://tbpl.mozilla.org/php/getParsedLog.php?id=6710381&tree=Mozilla-Aurora
Rev3 Fedora 12 mozilla-aurora opt test mochitests-3/5 on 2011-10-06 13:40:03 PDT for push 4c15202ca083

5937 INFO TEST-PASS | /tests/dom/workers/test/test_errorPropagation.html | Correct message event.filename - http://mochi.test:8888/tests/dom/workers/test/errorPropagation_worker.js should equal http://mochi.test:8888/tests/dom/workers/test/errorPropagation_worker.js
5938 INFO TEST-PASS | /tests/dom/workers/test/test_errorPropagation.html | Correct message event.lineno - 48 should equal 48
5939 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/workers/test/test_errorPropagation.html | Test timed out.
XScreenSaver state: Disabled
User input has been idle for 743 seconds
args: ['/home/cltbld/talos-slave/test/build/bin/screentopng']

...

16466 INFO SimpleTest FINISHED
TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output
INFO | automation.py | Application ran for: 0:14:38.951252
INFO | automation.py | Reading PID log: /tmp/tmp6qt-3tpidlog
PROCESS-CRASH | Shutdown | application crashed (minidump found)
Crash dump filename: /tmp/tmpCIjJRu/minidumps/68928f9b-b21b-2ce0-74dd07a2-79f2fec2.dmp
Operating system: Linux
                  0.0.0 Linux 2.6.31.5-127.fc12.i686.PAE #1 SMP Sat Nov 7 21:25:57 EST 2009 i686
CPU: x86
     GenuineIntel family 6 model 23 stepping 10
     2 CPUs

Crash reason:  SIGABRT
Crash address: 0x847

Thread 0 (crashed)
 0  libpthread-2.11.so + 0x8edd
    eip = 0x00c6eedd   esp = 0xbf9a0074   ebp = 0xbf9a0078   ebx = 0x08051a80
    esi = 0xb7673040   edi = 0x93dd1000   eax = 0x00000003   ecx = 0x00000008
    edx = 0x00000000   efl = 0x00200246
    Found by: given as instruction pointer in context
 1  firefox-bin!malloc_mutex_unlock [jemalloc.c : 1476 + 0x5]
    eip = 0x0804b1ae   esp = 0xbf9a0080   ebp = 0xb7673040
    Found by: previous frame's frame pointer
 2  firefox-bin!arena_malloc [jemalloc.c : 3818 + 0xa]
    eip = 0x0804d78d   esp = 0xbf9a00a0   ebp = 0xb7673040   ebx = 0x08051a80
    Found by: call frame info
 3  firefox-bin!imalloc [jemalloc.c : 3892 + 0x10]
    eip = 0x0804da5c   esp = 0xbf9a00d0   ebp = 0xbf9a018c   ebx = 0x08051a80
    esi = 0x0000000a   edi = 0x00000000
    Found by: call frame info
 4  firefox-bin!malloc [jemalloc.c : 5986 + 0x6]
    eip = 0x0804da9a   esp = 0xbf9a00f0   ebp = 0xbf9a018c   ebx = 0x08051a80
    esi = 0x0000000a   edi = 0x00000000
    Found by: call frame info
 5  libmozalloc.so!moz_malloc [mozalloc.cpp : 113 + 0x8]
    eip = 0x00143b41   esp = 0xbf9a0100   ebp = 0xbf9a018c   ebx = 0x00144118
    esi = 0x00000002   edi = 0x00000000
    Found by: call frame info
 6  libxul.so!nsStringBuffer::Alloc [nsSubstring.cpp : 209 + 0x8]
    eip = 0x01c29a4d   esp = 0xbf9a0120   ebp = 0xbf9a018c   ebx = 0x02323174
    esi = 0x00000002   edi = 0x00000000
    Found by: call frame info
 7  libxul.so!nsAString_internal::MutatePrep [nsTSubstring.cpp : 162 + 0x8]
    eip = 0x01c29c1d   esp = 0xbf9a0140   ebp = 0xbf9a018c   ebx = 0x02323174
    esi = 0xa5e02e38   edi = 0x00000000
    Found by: call frame info
 8  libxul.so!nsAString_internal::ReplacePrepInternal [nsTSubstring.cpp : 198 + 0x13]
    eip = 0x01c29c48   esp = 0xbf9a0170   ebp = 0x00000000   ebx = 0x02323174
    esi = 0xa5e02e38   edi = 0x00000000
    Found by: call frame info
 9  libxul.so!nsAString_internal::ReplacePrep [nsTSubstring.h : 685 + 0x10]
    eip = 0x01c29d56   esp = 0xbf9a01a0   ebp = 0x00000000   ebx = 0x02323174
    esi = 0xa5e02e38   edi = 0x00000000
    Found by: call frame info

https://tbpl.mozilla.org/php/getParsedLog.php?id=6710557&tree=Mozilla-Aurora
Rev3 Fedora 12 mozilla-aurora opt test mochitest-other on 2011-10-06 13:40:06 PDT for push 4c15202ca083

TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #4: message found 'multipart/form-data'
TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #4: error category 'HTML'
TEST-INFO | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | Console message: [JavaScript Warning: "Form contains enctype=multipart/form-data, but does not contain method=post.  Submitting normally with method=GET and no enctype instead." {file: "http://example.com/browser/browser/devtools/webconsole/test//browser/test-bug-595934-html.html" line: 0}]
TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #5: message found 'no element found'
TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #5: error category 'malformed-xml'
TEST-INFO | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | Console message: [JavaScript Error: "no element found" {file: "http://example.com/browser/browser/devtools/webconsole/test//browser/test-bug-595934-malformedxml.xhtml" line: 11}]
TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #6: message found 'fooBarSVG'
TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #6: error category 'SVG'
TEST-INFO | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | Console message: [JavaScript Warning: "Unexpected value fooBarSVG parsing height attribute." {file: "http://example.com/browser/browser/devtools/webconsole/test//browser/test-bug-595934-svg.xhtml" line: 0}]
TEST-INFO | unknown test url | [SimpleTest/SimpleTest.js, window.onerror] An error occurred: fooBarWorker is not defined at http://example.com/browser/browser/devtools/webconsole/test//browser/test-bug-595934-workers.js:7
TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | Test timed out
XScreenSaver state: Disabled
User input has been idle for 1338 seconds
args: ['/home/cltbld/talos-slave/test/build/bin/screentopng']

INFO TEST-END | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | finished in 30043ms
Note that we've also had other issues on Linux on the -O3 non-pgo builds... glandium might recall details.
We've had float rounding issues, but nothing crashy that i remember.
BTW, non-PGO should build -Os on aurora, by now.
(In reply to Mike Hommey [:glandium] from comment #3)
> BTW, non-PGO should build -Os on aurora, by now.

Confirmed, that was bug 655003.
Summary: Perma-orange Aurora Linux opt (non-PGO) browser_webconsole_bug_595934_message_categories.js | Test timed out and test_errorPropagation.html | Test timed out followed by shutdown timeout → Perma-orange Aurora (and Beta) Linux opt (non-PGO) browser_webconsole_bug_595934_message_categories.js | Test timed out and test_errorPropagation.html | Test timed out followed by shutdown timeout
Not quite literally permaorange on m-b: there have been three builds, two of which say in the scrape that they were clobbers, which hit this, and one which does not say it was, which did not hit this.
So, Luke pushed https://hg.mozilla.org/integration/mozilla-inbound/rev/4c0e6b3f7791, removing a totally empty header from jseng, and browser_webconsole_bug_595934_message_categories.js went permaorange on 10.6 opt.

The only thing I've seen doing that same sort of senseless and impossible in a reasonable world thing is when we've got code accessing memory that it totally shouldn't be touching, and a nothing change to jseng suddenly affects a test that couldn't possibly be affected. A previous case with a11y only happened with some PGO'd builds, when PGO happened to move things around to make the badness happen; apparently the Aurora Linux thing is the opposite, and PGO will cover our tracks but non-PGO leaves us busted.

Given that the Aurora case is both devtools and workers, does that mean that both of them are being bad, or is the devtools test being bitten by bad workers? Will running these tests under Valgrind lead to it shrieking in horror at the things we're doing?
Blocks: 689362
Severity: normal → blocker
The devtools hang is the only devtools test that uses a worker, so I would not be too suspicious of the devtools code.

The way these two failing tests work is that they register an error console listener and then use a worker to generate an error. The tests wait until they see the error appear in the console before continuing. Sometimes they randomly time out without anything else seeming to fail (at least, I don't see anything else in the logs).

I'm not ready to blame the worker code yet, though. I've looked over the worker error generation code a bunch now and can't see anything amiss. On a whim I decided to take a look at the code in the error console listener implementation and I ran across this (nsConsoleService.cpp#161):

  * Iterate through any registered listeners and tell them about
  * the message.  We use the mListening flag to guard against
  * recursive message logs.  This could sometimes result in
  * listeners being skipped because of activity on other threads,
  * when we only care about the recursive case.

That seems pretty bad.. If we're unlucky enough to have an error reported on another thread immediately before the worker code generates the error that the tests are waiting on we'll miss the report and the test will never finish.

I'll see what we can do to fix this behavior (or the tests as a bandaid - maybe by looking at the console message list every second or so?). Basically this problem boils down to a bad test mechanism, not a bug in workers.

However, I don't think we can simply wave this off due to the crashes, and I'm pretty sure that the crashes are entirely due to the fact that the console manager is using proxies of JS objects to make its listeners get called back on the correct thread. Proxies are no longer safe to be used by JS on other threads, so this is a big red flag.

In short, I blame the console service for all my woes ;)

CC'ing luke and bsmedberg since they have a plan to nuke proxies and may have a patch to fix that part.

Anyone think my hunches are bonkers?
Component: General → XPCOM
QA Contact: general → xpcom
Inconveniently, backing out luke did not fix inbound's permaorange. His 10.6 opt build was a clobber, the two before that were not (and the four before that were burning), so either of them or Waldo's backout not really backing everything out are possibilities if they didn't manage to take effect until after a clobber, as is some releng change. Not sure what, if any, part of that fits with blaming the console service.
No longer blocks: 689362
the inbound permaorange has been fixed by backing out Bug 689101, Bug 692987 and Bug 692722
(In reply to ben turner [:bent] from comment #85)> 
> Anyone think my hunches are bonkers?

Well, I don't get it, but that's not an unusual thing for me.

How does the theory that maybe sometimes an error gets reported on another thread right before your error work with the evidence, that in Linux32 PGO builds on mozilla-aurora, it never ever happens, but in Linux32 non-PGO builds on aurora it happens every single time for both tests, and in Linux32 non-PGO builds on mozilla-beta it happens every single time only for the devtools test, not for the workers test? And what part of sfink's pushes caused an error to be reported on another thread right before yours, every single time, only on 10.6 opt?
(In reply to Phil Ringnalda (:philor) from comment #90)
Dunno, race conditions like this are always pretty strange, subject to weird tiny variations in timing and whatever OS scheduler you're dealing with.

I'm reasonably certain this will go away once bug 675221 removes the proxies and makes the listener logic more sane.

In the meantime I believe that we're just going to disable this test.
Depends on: 675221
This is rs=sicking.
Assignee: nobody → bent.mozilla
Status: NEW → ASSIGNED
Attachment #566979 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/5b68174fe6f6
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla10
Comment on attachment 566979 [details] [diff] [review]
Patch to disable console listener tests

This just disables small parts of two tests that are problematic. The real fix will land on trunk soon-ish, but the patch isn't really suitable for branches. Disabling these test pieces in the interim seems the smartest path forward.
Attachment #566979 - Flags: approval-mozilla-beta?
Attachment #566979 - Flags: approval-mozilla-aurora?
can we turn this off on aurora or is this a real bug?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [orange] [inbound] → [orange] [inbound][see comment #111]
Ok, as per c#111, we can create a patch to turn off the test on aurora. We'll do the work in this bug until it lands.
and, since I'm talking to myself in this bug now, can we get approval to land BenT's patch? :)
I think we leave this FIXED since it has landed on trunk then wait for them to approve the patch for aurora.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Maybe so, I just wanted the aurora-approver-gods to see that there was still something to be done here. Hopefully my incessant prattling will convince them of that! :)
Comment on attachment 566979 [details] [diff] [review]
Patch to disable console listener tests

Make it so!
Attachment #566979 - Flags: approval-mozilla-beta?
Attachment #566979 - Flags: approval-mozilla-beta+
Attachment #566979 - Flags: approval-mozilla-aurora?
Attachment #566979 - Flags: approval-mozilla-aurora+
Whiteboard: [orange] [inbound][see comment #111] → [qa-][orange] [inbound][see comment #111]
Heads up: having revised the infrastructure for console-monitoring tests, I turned these tests back on in bug 663291 (specifically, https://hg.mozilla.org/integration/mozilla-inbound/rev/b36eaac9ecf8 ) I may not have gotten all the race conditions.
... and they were backed out again because yep, still race conditions.  Filed new bug 812714 to get them fixed properly.
Whiteboard: [qa-][orange] [inbound][see comment #111] → [qa-] [inbound][see comment #111]
You need to log in before you can comment on or make changes to this bug.