Last Comment Bug 652459 - TM Windows nightlies (twice) and Linux64 opt builds (permanently) fail multiple mochitest-a11y tests with "getTextBeforeOffset" errors
: TM Windows nightlies (twice) and Linux64 opt builds (permanently) fail multip...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: x86 Windows 7
: -- blocker (vote)
: mozilla6
Assigned To: general
:
Mentors:
Depends on: 654999
Blocks: gcc4.5
  Show dependency treegraph
 
Reported: 2011-04-24 14:23 PDT by Phil Ringnalda (:philor)
Modified: 2011-05-21 06:53 PDT (History)
13 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
backout disabling the tests (762 bytes, patch)
2011-05-10 13:03 PDT, Trevor Saunders (:tbsaunde)
dbolter: review+
Details | Diff | Splinter Review
Patch fixing the issue and test marked as TODO (15.82 KB, patch)
2011-05-20 01:56 PDT, Fernando Herrera
surkov.alexander: review+
Details | Diff | Splinter Review

Description Phil Ringnalda (:philor) 2011-04-24 14:23:37 PDT
On both the Windows nightly built 2011/04/23 and the one built 2011/04/24, the mochitest-a11y (part of mochitest-oth, so the M(oth) on tbpl) tests have failed with multiple failures like "getTextBeforeOffset (word start): wrong text, offset: 0, id: ' 'input' '; - "" should equal """ (http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1303653153.1303654716.27583.gz) spread across multiple tests. There are a couple of interesting things about that:

* The regular depend build from the same cset doesn't fail, even though in the case of the 2011/04/23 one, the depend build says that it did a periodic clobber, so it was also a full build. The failure is repeatable on the same build: I retriggered both the failed and the passing tests on the 2011/04/23 builds (which is supposed to then download the same build, nightly or depend, and seems to have done so), and the passes stayed passing while the failures stayed failing. I retriggered the failed runs on Win7 and WinXP on the 2011/04/24 nightly seven times each, and all seven failed.

* On 2011/04/21, dvander landed http://hg.mozilla.org/tracemonkey/rev/d851d44ad77a which caused Linux64 opt mochitest-a11y to fail with the same sort of getTextBeforeOffset thing (http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1303488964.1303490147.21467.gz), every run rather than only nightlies or only clobbers, until it was backed out on 2011/04/22.

Despite how interesting it would be to see what would happen after a merge to mozilla-central, I sort of doubt we want to merge a baffling a11y-breaker just to find out.
Comment 1 Phil Ringnalda (:philor) 2011-04-24 21:14:31 PDT
Because I have odd ideas about what would be fun, for fun I clobbered the Windows slaves (which is supposed to be the moral equivalent of a nightly), and triggered another build on Windows on the same tip rev that was fine as a depend and failed as a nightly, and it's also fine as a clobbered depend.
Comment 2 Phil Ringnalda (:philor) 2011-04-25 07:49:29 PDT
And then because the world likes to be random, the next nightly, which built on the same cset as the last, does not fail.
Comment 3 Paul Biggar 2011-04-29 08:18:58 PDT
This has gone perma-orange since the gcc-4.5 upgrade on linux64-opt. philor fired off a rebuild on the last good version before the upgrade: http://tree.mozilla.org/?tree=TraceMonkey&rev=268942ceb06f. So that should tell us whether the upgrade is responsible.

Either way, we probably have to deal with a closed tree until this is resolved.
Comment 4 Phil Ringnalda (:philor) 2011-04-29 14:42:15 PDT
Yeah, closing it now. That rebuild came up orange, so gcc-4.5 has brought us the same thing dvander's patch did, only possibly with nothing we can use as a scapegoat. I retriggered the build on the last merge from m-c, if that comes up green then I can start the hunt for a  scapegoat in earnest, retriggering builds on everything that's landed since then until I find the first one that fails, and that lucky person will have to figure it out (unless someone else lands something which also breaks it, before they do figure it out).
Comment 5 Phil Ringnalda (:philor) 2011-04-29 18:42:30 PDT
Unfortunately, the rebuild on Tuesday's m-c merge came up orange, so either something landed on m-c between Tuesday morning and Thursday morning which protected them from this, which we need to merge, or something inconceivable is happening.
Comment 6 Phil Ringnalda (:philor) 2011-04-29 18:56:43 PDT
Also seems pretty strongly associated with a "dom/src/threads/test/test_closeOnGC.html | Test timed out" also on Linux64, which mozilla-central also isn't seeing.
Comment 7 Mike Hommey [:glandium] 2011-04-30 02:10:53 PDT
(In reply to comment #5)
> Unfortunately, the rebuild on Tuesday's m-c merge came up orange

Are you sure? Looking at tbpl, it looks like green on the rebuild of 28bc239d3d9d...

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1304111798.1304113030.7492.gz
Comment 8 Phil Ringnalda (:philor) 2011-04-30 09:52:36 PDT
That's the one where I retriggered the mochitest-other directly, which just reran it on the same original build, as a control to verify that we hadn't made any infra changes that caused the problem.

Then I retriggered the build, letting the build trigger a new round of tests. Thanks to tbpl foolishly thinking that builds and tests happen within a given time after a push (because Tinderbox only has one API, "things that happened in a given timespan"), you cannot see that on tbpl.m.o, but since mine's turned up to looking 96 hours after a push instead of just 12, http://dev.philringnalda.com/tbpl/?tree=TraceMonkey&rev=28bc239d3d9d will show you the retriggered builds (I did a second when the first looked like it wouldn't ever finish, and the second retriggering caused two more builds instead of just one), and the three orange mochitest-other runs from running the tests on old code compiled with a new mozconfig.
Comment 9 Phil Ringnalda (:philor) 2011-05-01 09:28:05 PDT
Something did land on mozilla-central last week which either fixes this, or just happens to protect them from this: http://dev.philringnalda.com/tbpl/?tree=Try&rev=9a9aaf90ba8d is a try push with a mozilla-central parent from April 25th which hit this.

Given that we have every reason to believe that a merge will make it go away (though no particular reason to believe that it will be fixing it), should we stay closed, or just open and pile pushes on orange, trusting that we won't affect the outcome of the merge with any of them?
Comment 10 Phil Ringnalda (:philor) 2011-05-02 16:18:53 PDT
Until next time.
Comment 11 Mike Hommey [:glandium] 2011-05-05 01:22:47 PDT
I took the latest tinderbox build as of now and the corresponding tests.

If I run test_singleline.html repeatedly, the number of fails varies greatly, within the three values 2, 24 and 26.

If I remove all tests but the getTextBeforeOffset ones from that test, which are supposedly the ones failing, I get no failure.
Comment 12 Mike Hommey [:glandium] 2011-05-05 01:59:02 PDT
... and, it's not perma-orange on tm where it is on m-c...
Comment 13 Mike Hommey [:glandium] 2011-05-05 04:51:44 PDT
and it's perma-green when run under valgrind... but valgrind complains a bunch about uninitialized values on a stack allocation from nsHyperTextAccessible::GetTextHelper at nsHyperTextAccessible.cpp:892...
Comment 14 Mike Hommey [:glandium] 2011-05-05 05:04:51 PDT
See bug 590181 comment 147
Comment 15 David Bolter [:davidb] ***PTO until 29th*** 2011-05-05 05:45:44 PDT
(In reply to comment #13)
> and it's perma-green when run under valgrind... but valgrind complains a bunch
> about uninitialized values on a stack allocation from
> nsHyperTextAccessible::GetTextHelper at nsHyperTextAccessible.cpp:892...

Interesting. Can you please file a bug (core/disability) and paste the valgrind output there?
Comment 16 David Bolter [:davidb] ***PTO until 29th*** 2011-05-05 05:55:21 PDT
(In reply to comment #11)
> If I remove all tests but the getTextBeforeOffset ones from that test, which
> are supposedly the ones failing, I get no failure.

Please go ahead and disable those tests for now and open the tree.

Fernando told me we are calling GetRenderedText when we shouldn't be, and that this may all be solved when we use cached text for this.
Comment 17 David Bolter [:davidb] ***PTO until 29th*** 2011-05-05 06:05:16 PDT
(In reply to comment #16)
> (In reply to comment #11)
> > If I remove all tests but the getTextBeforeOffset ones from that test, which
> > are supposedly the ones failing, I get no failure.
> 
> Please go ahead and disable those tests for now and open the tree.
> 
> Fernando told me we are calling GetRenderedText when we shouldn't be, and that
> this may all be solved when we use cached text for this.

Oof I misread Mike's comment. I thought he was just removing the getTextBeforeOffset ones.
Comment 18 Boris Zbarsky [:bz] 2011-05-05 06:06:26 PDT
I have disabled these tests:

  accessible/tests/mochitest/text/test_singleline.html
  accessible/tests/mochitest/text/test_whitespaces.html

in http://hg.mozilla.org/mozilla-central/rev/2f2d4de42515
Comment 19 Mike Hommey [:glandium] 2011-05-05 06:50:13 PDT
(In reply to comment #15)
> (In reply to comment #13)
> > and it's perma-green when run under valgrind... but valgrind complains a bunch
> > about uninitialized values on a stack allocation from
> > nsHyperTextAccessible::GetTextHelper at nsHyperTextAccessible.cpp:892...
> 
> Interesting. Can you please file a bug (core/disability) and paste the valgrind
> output there?

Filed bug 654999. As it's not clear if it's related, I didn't put any dependency on this bug.
Comment 20 alexander :surkov 2011-05-10 02:32:14 PDT
It sounds like we've got disabled tests entirely while some part of tests were failing. Other point is we don't have good tests for text interface anymore so we are in risk of serious regressions. While we don't have any good guess why they were failed then we are faced these tests are disabled for a long time.
Comment 21 Mike Hommey [:glandium] 2011-05-10 03:12:20 PDT
Note that now PGO is back on again, we should be able to backout 2f2d4de42515.
Comment 22 Boris Zbarsky [:bz] 2011-05-10 10:17:23 PDT
Note that I disabled the tests on davidb's request.  I agree that we don't want them to be disabled.  Furthermore, we need to figure out _why_ they're failing without PGO on Linux...
Comment 23 Trevor Saunders (:tbsaunde) 2011-05-10 13:03:58 PDT
Created attachment 531434 [details] [diff] [review]
backout disabling the tests

So, it seems like we can seperate this into 2 problems

* reenable the tests on m-c we think this should just work with the current optamization level.  Is tryserver currently using the same flags for optamized builds as m-c? if not can it be made to use the same flags?  Either way I'll try to land this at night / in the evening to minimize the pain should it stay orange.

*  make this all work with any optimization level, in other words find out what certain optimizations cause to break.  I would gues that this part depends on fixing the valgrind errors mentioned above, or atleast it would make sense to see if this persists after they are fixed.  To test this can we easily change what optamization is done on try builds?
Comment 24 David Bolter [:davidb] ***PTO until 29th*** 2011-05-10 13:08:31 PDT
Comment on attachment 531434 [details] [diff] [review]
backout disabling the tests

r=me
Comment 25 Mike Hommey [:glandium] 2011-05-10 13:11:50 PDT
(In reply to comment #23)
> To test this can
> we easily change what optamization is done on try builds?

You can change optimization level in configure.in (e.g backing out 3abe2f8c592e in your try push), and you can disable PGO with https://wiki.mozilla.org/ReleaseEngineering/TryServer#How_to_get_non-PGO_coverage

If I recall correctly, you need to do both to reproduce the perma-orange.
Comment 26 Mike Hommey [:glandium] 2011-05-10 13:13:55 PDT
(In reply to comment #25)
> (In reply to comment #23)
> > To test this can
> > we easily change what optamization is done on try builds?
> 
> You can change optimization level in configure.in (e.g backing out
> 3abe2f8c592e in your try push), and you can disable PGO with
> https://wiki.mozilla.org/ReleaseEngineering/TryServer#How_to_get_non-
> PGO_coverage
> 
> If I recall correctly, you need to do both to reproduce the perma-orange.

Though I seem to recall that it also was perma-orange on tm when it had PGO and 3abe2f8c592e wasn't merged on it, so backing out 3abe2f8c592e alone should trigger it (but takes longer to build)
Comment 27 Trevor Saunders (:tbsaunde) 2011-05-11 23:06:15 PDT
REENABLED THE TESTS IN http://hg.mozilla.org/mozilla-central/rev/745a8761a2d8
Comment 28 Phil Ringnalda (:philor) 2011-05-13 23:42:46 PDT
And in the current episode:

The http://hg.mozilla.org/tracemonkey/rev/0de1cf797699 merge from mozilla-central at 14:45 May 9th merged disabling the tests to TM.

Between then and 15:09 on May 13th one of the 28 pushes to TM caused this to happen on Windows opt, but we didn't know it because the tests were disabled.

The http://hg.mozilla.org/tracemonkey/rev/599d1c6cba63 merge from mozilla-central at 15:09 May 13th merged reenabling the tests to TM. Once Windows built and ran m-a11y, both Win7 and WinXP failed, but by then TM had already been merged back to mozilla-central, bringing that orange with it.

http://hg.mozilla.org/mozilla-central/rev/8404426ef391 disabled the tests again on mozilla-central.

I closed TM (again, yeah), until someone either figures out why random jseng patches cause this in random optimized versions on random OSes, and according to rumor, can then uncause it by moving around blocks of code, or until someone just backs out everything that landed on TM between Tuesday and Friday until it goes green.
Comment 29 Mike Hommey [:glandium] 2011-05-13 23:54:25 PDT
(In reply to comment #28)
> I closed TM (again, yeah), until someone either figures out why random jseng
> patches cause this in random optimized versions on random OSes, and
> according to rumor, can then uncause it by moving around blocks of code

I think the reason is as "simple" as bug 654999 comment 7.
Comment 30 Chris Leary [:cdleary] (not checking bugmail) 2011-05-16 10:26:38 PDT
(In reply to comment #29)
> I think the reason is as "simple" as bug 654999 comment 7.

Who is the current owner for nsHyperTextAccessible::GetTextHelper?

This is clearly not a JS issue. I'm happy to provide the appropriate person with the changeset where the WinOpt went permaorange, but given that Julian's attempt at an obvious fix didn't pan out, we need to pass this off to people with domain knowledge who can hack on it via try. I'm going to disable the test on the tracemonkey side and reopen the TM tree if there are no serious objections.
Comment 31 Boris Zbarsky [:bz] 2011-05-16 11:10:31 PDT
> Who is the current owner for nsHyperTextAccessible::GetTextHelper?

Some combination of davidb and surkov (cced on this bug).
Comment 32 Chris Leary [:cdleary] (not checking bugmail) 2011-05-16 15:10:55 PDT
Okay, http://hg.mozilla.org/tracemonkey/rev/599d1c6cba63 is the TM/MC merge changeset that gives the perma-orange. davidb/surkov, you guys have the con!
Comment 33 Chris Leary [:cdleary] (not checking bugmail) 2011-05-17 10:22:27 PDT
Could somebody take assignment on this? I just want to make sure it gets done --  we shouldn't just drop failing tests from our suite when they're reproducibly orange on our build machines and we have a known proximal cause.
Comment 34 David Bolter [:davidb] ***PTO until 29th*** 2011-05-17 10:41:26 PDT
Assigning to Fernando since he is working on the a11y side of this. The orange tests are currently disabled. If we go ahead with a TM merge we have until the 24th to fix on the a11y end and failing that would face backing out the TM merge. Obviously this will be a priority.
Comment 35 Trevor Saunders (:tbsaunde) 2011-05-17 11:32:52 PDT
(In reply to comment #33)
> Could somebody take assignment on this? I just want to make sure it gets
> done --  we shouldn't just drop failing tests from our suite when they're

yeah, that was already being worked on in bug 654999 some, sorry that wasn't clerer


> reproducibly orange on our build machines and we have a known proximal cause.
Comment 36 Fernando Herrera 2011-05-18 09:44:06 PDT
I pushed this change:

   if (aType == eGetBefore) {
 -    endOffset = aOffset;
 +    finalEndOffset = aOffset;
    }

To the TM tryserver:

http://tbpl.mozilla.org/?tree=Try&pusher=fherrera@mozilla.com&rev=b06c31f9b75d
Comment 37 Mike Hommey [:glandium] 2011-05-18 09:46:58 PDT
(In reply to comment #36)
> I pushed this change:
> 
>    if (aType == eGetBefore) {
>  -    endOffset = aOffset;
>  +    finalEndOffset = aOffset;
>     }
> 
> To the TM tryserver:
> 
> http://tbpl.mozilla.org/?tree=Try&pusher=fherrera@mozilla.
> com&rev=b06c31f9b75d

https://bugzilla.mozilla.org/show_bug.cgi?id=654999#c11
Comment 38 Fernando Herrera 2011-05-20 01:56:23 PDT
Created attachment 533908 [details] [diff] [review]
Patch fixing the issue and test marked as TODO

As the patch also includes fixes for some test results, I'm attaching it complete
Comment 39 alexander :surkov 2011-05-20 03:37:08 PDT
Comment on attachment 533908 [details] [diff] [review]
Patch fixing the issue and test marked as TODO

Review of attachment 533908 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, obviously that was a misspelling
Comment 40 alexander :surkov 2011-05-20 03:39:23 PDT
(In reply to comment #38)
> Created attachment 533908 [details] [diff] [review] [review]
> Patch fixing the issue and test marked as TODO
> 
> As the patch also includes fixes for some test results, I'm attaching it
> complete

It should be landed as for bug 654999 since you're not sure it's a right fix for this one, right?
Comment 41 Mike Hommey [:glandium] 2011-05-20 03:59:11 PDT
(In reply to comment #40)
> (In reply to comment #38)
> > Created attachment 533908 [details] [diff] [review] [review] [review]
> > Patch fixing the issue and test marked as TODO
> > 
> > As the patch also includes fixes for some test results, I'm attaching it
> > complete
> 
> It should be landed as for bug 654999 since you're not sure it's a right fix
> for this one, right?

It's easy to verify: backout 3abe2f8c592e, add a mozconfig-extra file containing mk_add_options MOZ_PGO=, and push to try.
Comment 42 alexander :surkov 2011-05-21 00:10:14 PDT
I landed the patch - http://hg.mozilla.org/mozilla-central/rev/5ebbe0f78c74. Fernando, can you make sure this patch fixes this bug too?
Comment 43 Mike Hommey [:glandium] 2011-05-21 06:03:16 PDT
Pushed to try. We'll see how it goes.
http://tbpl.mozilla.org/?tree=Try&rev=e4ea52ae7293
Comment 44 Mike Hommey [:glandium] 2011-05-21 06:53:35 PDT
And this is all green.

Note You need to log in before you can comment on or make changes to this bug.