Closed Bug 652459 Opened 13 years ago Closed 13 years ago

TM Windows nightlies (twice) and Linux64 opt builds (permanently) fail multiple mochitest-a11y tests with "getTextBeforeOffset" errors

Categories

(Core :: JavaScript Engine, defect)

x86
Windows 7
defect
Not set
blocker

Tracking

()

RESOLVED FIXED
mozilla6

People

(Reporter: philor, Unassigned)

References

Details

Attachments

(2 files)

On both the Windows nightly built 2011/04/23 and the one built 2011/04/24, the mochitest-a11y (part of mochitest-oth, so the M(oth) on tbpl) tests have failed with multiple failures like "getTextBeforeOffset (word start): wrong text, offset: 0, id: ' 'input' '; - "" should equal """ (http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1303653153.1303654716.27583.gz) spread across multiple tests. There are a couple of interesting things about that:

* The regular depend build from the same cset doesn't fail, even though in the case of the 2011/04/23 one, the depend build says that it did a periodic clobber, so it was also a full build. The failure is repeatable on the same build: I retriggered both the failed and the passing tests on the 2011/04/23 builds (which is supposed to then download the same build, nightly or depend, and seems to have done so), and the passes stayed passing while the failures stayed failing. I retriggered the failed runs on Win7 and WinXP on the 2011/04/24 nightly seven times each, and all seven failed.

* On 2011/04/21, dvander landed http://hg.mozilla.org/tracemonkey/rev/d851d44ad77a which caused Linux64 opt mochitest-a11y to fail with the same sort of getTextBeforeOffset thing (http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1303488964.1303490147.21467.gz), every run rather than only nightlies or only clobbers, until it was backed out on 2011/04/22.

Despite how interesting it would be to see what would happen after a merge to mozilla-central, I sort of doubt we want to merge a baffling a11y-breaker just to find out.
Because I have odd ideas about what would be fun, for fun I clobbered the Windows slaves (which is supposed to be the moral equivalent of a nightly), and triggered another build on Windows on the same tip rev that was fine as a depend and failed as a nightly, and it's also fine as a clobbered depend.
And then because the world likes to be random, the next nightly, which built on the same cset as the last, does not fail.
Blocks: 628332
This has gone perma-orange since the gcc-4.5 upgrade on linux64-opt. philor fired off a rebuild on the last good version before the upgrade: http://tree.mozilla.org/?tree=TraceMonkey&rev=268942ceb06f. So that should tell us whether the upgrade is responsible.

Either way, we probably have to deal with a closed tree until this is resolved.
Yeah, closing it now. That rebuild came up orange, so gcc-4.5 has brought us the same thing dvander's patch did, only possibly with nothing we can use as a scapegoat. I retriggered the build on the last merge from m-c, if that comes up green then I can start the hunt for a  scapegoat in earnest, retriggering builds on everything that's landed since then until I find the first one that fails, and that lucky person will have to figure it out (unless someone else lands something which also breaks it, before they do figure it out).
Summary: TM Windows nightlies fail multiple mochitest-a11y tests with "getTextBeforeOffset" errors → TM Windows nightlies (twice) and Linux64 opt builds (permanently) fail multiple mochitest-a11y tests with "getTextBeforeOffset" errors
Unfortunately, the rebuild on Tuesday's m-c merge came up orange, so either something landed on m-c between Tuesday morning and Thursday morning which protected them from this, which we need to merge, or something inconceivable is happening.
Also seems pretty strongly associated with a "dom/src/threads/test/test_closeOnGC.html | Test timed out" also on Linux64, which mozilla-central also isn't seeing.
(In reply to comment #5)
> Unfortunately, the rebuild on Tuesday's m-c merge came up orange

Are you sure? Looking at tbpl, it looks like green on the rebuild of 28bc239d3d9d...

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1304111798.1304113030.7492.gz
That's the one where I retriggered the mochitest-other directly, which just reran it on the same original build, as a control to verify that we hadn't made any infra changes that caused the problem.

Then I retriggered the build, letting the build trigger a new round of tests. Thanks to tbpl foolishly thinking that builds and tests happen within a given time after a push (because Tinderbox only has one API, "things that happened in a given timespan"), you cannot see that on tbpl.m.o, but since mine's turned up to looking 96 hours after a push instead of just 12, http://dev.philringnalda.com/tbpl/?tree=TraceMonkey&rev=28bc239d3d9d will show you the retriggered builds (I did a second when the first looked like it wouldn't ever finish, and the second retriggering caused two more builds instead of just one), and the three orange mochitest-other runs from running the tests on old code compiled with a new mozconfig.
Something did land on mozilla-central last week which either fixes this, or just happens to protect them from this: http://dev.philringnalda.com/tbpl/?tree=Try&rev=9a9aaf90ba8d is a try push with a mozilla-central parent from April 25th which hit this.

Given that we have every reason to believe that a merge will make it go away (though no particular reason to believe that it will be fixing it), should we stay closed, or just open and pile pushes on orange, trusting that we won't affect the outcome of the merge with any of them?
Until next time.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INCOMPLETE
I took the latest tinderbox build as of now and the corresponding tests.

If I run test_singleline.html repeatedly, the number of fails varies greatly, within the three values 2, 24 and 26.

If I remove all tests but the getTextBeforeOffset ones from that test, which are supposedly the ones failing, I get no failure.
... and, it's not perma-orange on tm where it is on m-c...
and it's perma-green when run under valgrind... but valgrind complains a bunch about uninitialized values on a stack allocation from nsHyperTextAccessible::GetTextHelper at nsHyperTextAccessible.cpp:892...
See bug 590181 comment 147
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
No longer blocks: 628332
(In reply to comment #13)
> and it's perma-green when run under valgrind... but valgrind complains a bunch
> about uninitialized values on a stack allocation from
> nsHyperTextAccessible::GetTextHelper at nsHyperTextAccessible.cpp:892...

Interesting. Can you please file a bug (core/disability) and paste the valgrind output there?
(In reply to comment #11)
> If I remove all tests but the getTextBeforeOffset ones from that test, which
> are supposedly the ones failing, I get no failure.

Please go ahead and disable those tests for now and open the tree.

Fernando told me we are calling GetRenderedText when we shouldn't be, and that this may all be solved when we use cached text for this.
(In reply to comment #16)
> (In reply to comment #11)
> > If I remove all tests but the getTextBeforeOffset ones from that test, which
> > are supposedly the ones failing, I get no failure.
> 
> Please go ahead and disable those tests for now and open the tree.
> 
> Fernando told me we are calling GetRenderedText when we shouldn't be, and that
> this may all be solved when we use cached text for this.

Oof I misread Mike's comment. I thought he was just removing the getTextBeforeOffset ones.
I have disabled these tests:

  accessible/tests/mochitest/text/test_singleline.html
  accessible/tests/mochitest/text/test_whitespaces.html

in http://hg.mozilla.org/mozilla-central/rev/2f2d4de42515
(In reply to comment #15)
> (In reply to comment #13)
> > and it's perma-green when run under valgrind... but valgrind complains a bunch
> > about uninitialized values on a stack allocation from
> > nsHyperTextAccessible::GetTextHelper at nsHyperTextAccessible.cpp:892...
> 
> Interesting. Can you please file a bug (core/disability) and paste the valgrind
> output there?

Filed bug 654999. As it's not clear if it's related, I didn't put any dependency on this bug.
Blocks: 590181
Blocks: gcc4.5
No longer blocks: 590181
It sounds like we've got disabled tests entirely while some part of tests were failing. Other point is we don't have good tests for text interface anymore so we are in risk of serious regressions. While we don't have any good guess why they were failed then we are faced these tests are disabled for a long time.
Note that now PGO is back on again, we should be able to backout 2f2d4de42515.
Note that I disabled the tests on davidb's request.  I agree that we don't want them to be disabled.  Furthermore, we need to figure out _why_ they're failing without PGO on Linux...
So, it seems like we can seperate this into 2 problems

* reenable the tests on m-c we think this should just work with the current optamization level.  Is tryserver currently using the same flags for optamized builds as m-c? if not can it be made to use the same flags?  Either way I'll try to land this at night / in the evening to minimize the pain should it stay orange.

*  make this all work with any optimization level, in other words find out what certain optimizations cause to break.  I would gues that this part depends on fixing the valgrind errors mentioned above, or atleast it would make sense to see if this persists after they are fixed.  To test this can we easily change what optamization is done on try builds?
Attachment #531434 - Flags: review?
Comment on attachment 531434 [details] [diff] [review]
backout disabling the tests

r=me
Attachment #531434 - Flags: review? → review+
(In reply to comment #23)
> To test this can
> we easily change what optamization is done on try builds?

You can change optimization level in configure.in (e.g backing out 3abe2f8c592e in your try push), and you can disable PGO with https://wiki.mozilla.org/ReleaseEngineering/TryServer#How_to_get_non-PGO_coverage

If I recall correctly, you need to do both to reproduce the perma-orange.
(In reply to comment #25)
> (In reply to comment #23)
> > To test this can
> > we easily change what optamization is done on try builds?
> 
> You can change optimization level in configure.in (e.g backing out
> 3abe2f8c592e in your try push), and you can disable PGO with
> https://wiki.mozilla.org/ReleaseEngineering/TryServer#How_to_get_non-
> PGO_coverage
> 
> If I recall correctly, you need to do both to reproduce the perma-orange.

Though I seem to recall that it also was perma-orange on tm when it had PGO and 3abe2f8c592e wasn't merged on it, so backing out 3abe2f8c592e alone should trigger it (but takes longer to build)
And in the current episode:

The http://hg.mozilla.org/tracemonkey/rev/0de1cf797699 merge from mozilla-central at 14:45 May 9th merged disabling the tests to TM.

Between then and 15:09 on May 13th one of the 28 pushes to TM caused this to happen on Windows opt, but we didn't know it because the tests were disabled.

The http://hg.mozilla.org/tracemonkey/rev/599d1c6cba63 merge from mozilla-central at 15:09 May 13th merged reenabling the tests to TM. Once Windows built and ran m-a11y, both Win7 and WinXP failed, but by then TM had already been merged back to mozilla-central, bringing that orange with it.

http://hg.mozilla.org/mozilla-central/rev/8404426ef391 disabled the tests again on mozilla-central.

I closed TM (again, yeah), until someone either figures out why random jseng patches cause this in random optimized versions on random OSes, and according to rumor, can then uncause it by moving around blocks of code, or until someone just backs out everything that landed on TM between Tuesday and Friday until it goes green.
Status: REOPENED → NEW
(In reply to comment #28)
> I closed TM (again, yeah), until someone either figures out why random jseng
> patches cause this in random optimized versions on random OSes, and
> according to rumor, can then uncause it by moving around blocks of code

I think the reason is as "simple" as bug 654999 comment 7.
(In reply to comment #29)
> I think the reason is as "simple" as bug 654999 comment 7.

Who is the current owner for nsHyperTextAccessible::GetTextHelper?

This is clearly not a JS issue. I'm happy to provide the appropriate person with the changeset where the WinOpt went permaorange, but given that Julian's attempt at an obvious fix didn't pan out, we need to pass this off to people with domain knowledge who can hack on it via try. I'm going to disable the test on the tracemonkey side and reopen the TM tree if there are no serious objections.
> Who is the current owner for nsHyperTextAccessible::GetTextHelper?

Some combination of davidb and surkov (cced on this bug).
Okay, http://hg.mozilla.org/tracemonkey/rev/599d1c6cba63 is the TM/MC merge changeset that gives the perma-orange. davidb/surkov, you guys have the con!
Could somebody take assignment on this? I just want to make sure it gets done --  we shouldn't just drop failing tests from our suite when they're reproducibly orange on our build machines and we have a known proximal cause.
Assigning to Fernando since he is working on the a11y side of this. The orange tests are currently disabled. If we go ahead with a TM merge we have until the 24th to fix on the a11y end and failing that would face backing out the TM merge. Obviously this will be a priority.
Assignee: general → fherrera
(In reply to comment #33)
> Could somebody take assignment on this? I just want to make sure it gets
> done --  we shouldn't just drop failing tests from our suite when they're

yeah, that was already being worked on in bug 654999 some, sorry that wasn't clerer


> reproducibly orange on our build machines and we have a known proximal cause.
I pushed this change:

   if (aType == eGetBefore) {
 -    endOffset = aOffset;
 +    finalEndOffset = aOffset;
    }

To the TM tryserver:

http://tbpl.mozilla.org/?tree=Try&pusher=fherrera@mozilla.com&rev=b06c31f9b75d
Assignee: fherrera → general
(In reply to comment #36)
> I pushed this change:
> 
>    if (aType == eGetBefore) {
>  -    endOffset = aOffset;
>  +    finalEndOffset = aOffset;
>     }
> 
> To the TM tryserver:
> 
> http://tbpl.mozilla.org/?tree=Try&pusher=fherrera@mozilla.
> com&rev=b06c31f9b75d

https://bugzilla.mozilla.org/show_bug.cgi?id=654999#c11
As the patch also includes fixes for some test results, I'm attaching it complete
Comment on attachment 533908 [details] [diff] [review]
Patch fixing the issue and test marked as TODO

Review of attachment 533908 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, obviously that was a misspelling
Attachment #533908 - Flags: review+
(In reply to comment #38)
> Created attachment 533908 [details] [diff] [review] [review]
> Patch fixing the issue and test marked as TODO
> 
> As the patch also includes fixes for some test results, I'm attaching it
> complete

It should be landed as for bug 654999 since you're not sure it's a right fix for this one, right?
(In reply to comment #40)
> (In reply to comment #38)
> > Created attachment 533908 [details] [diff] [review] [review] [review]
> > Patch fixing the issue and test marked as TODO
> > 
> > As the patch also includes fixes for some test results, I'm attaching it
> > complete
> 
> It should be landed as for bug 654999 since you're not sure it's a right fix
> for this one, right?

It's easy to verify: backout 3abe2f8c592e, add a mozconfig-extra file containing mk_add_options MOZ_PGO=, and push to try.
Depends on: 654999
I landed the patch - http://hg.mozilla.org/mozilla-central/rev/5ebbe0f78c74. Fernando, can you make sure this patch fixes this bug too?
Pushed to try. We'll see how it goes.
http://tbpl.mozilla.org/?tree=Try&rev=e4ea52ae7293
And this is all green.
Status: NEW → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla6
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: