Closed Bug 617152 Opened 9 years ago Closed 9 years ago

Reftest harness test termination logic needs an overhaul

Categories

(Core :: Layout, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: roc, Assigned: roc)

References

Details

Attachments

(15 files, 6 obsolete files)

1.91 KB, patch
dbaron
: review+
Details | Diff | Splinter Review
23.79 KB, patch
Details | Diff | Splinter Review
24.28 KB, patch
Details | Diff | Splinter Review
5.16 KB, patch
dbaron
: review+
Details | Diff | Splinter Review
2.76 KB, patch
dbaron
: review+
Details | Diff | Splinter Review
1.22 KB, patch
dbaron
: review+
Details | Diff | Splinter Review
1.26 KB, patch
tnikkel
: review+
Details | Diff | Splinter Review
4.00 KB, patch
roc
: review+
Details | Diff | Splinter Review
23.44 KB, patch
dbaron
: review+
Details | Diff | Splinter Review
1.12 KB, patch
dbaron
: review+
Details | Diff | Splinter Review
4.08 KB, patch
tnikkel
: review+
Details | Diff | Splinter Review
2.62 KB, patch
jaas
: review-
Details | Diff | Splinter Review
1.25 KB, patch
Details | Diff | Splinter Review
2.62 KB, patch
jaas
: review+
Details | Diff | Splinter Review
3.08 KB, patch
tnikkel
: review+
Details | Diff | Splinter Review
See https://bugzilla.mozilla.org/show_bug.cgi?id=579808#c18

While working on this I also discovered some problems in existing reftests, which I'll attach fixes for.
This test creates a data: URL image in the same script that also removes reftest-wait, and expects the image to finish loading in time for the reftest snapshot. That seems like a bad assumption to me --- it would be hard to guarantee in general. So fix the test to remove reftest-wait in the image's onload handler.
Attachment #495645 - Flags: review?(dbaron)
Attached patch Part 2: fix SVG-as-image tests (obsolete) — Splinter Review
Some of the SVG-as-image tests are similar: they create images with data: URLs during the document onload handler, and expect those images to finish loading before we take the reftest snapshot.

These tests don't actually need to be reftest-wait tests at all. We can create the SVG content while the test loads and the document load event (and hence the snapshot) will wait for the SVG images to load.

Also, if we set <img> and <embed> to vertical-align:top, the tests pass for me on Windows.
Attachment #495647 - Flags: review?(dholbert)
Version: unspecified → Trunk
Comment on attachment 495647 [details] [diff] [review]
Part 2: fix SVG-as-image tests

Looks great - thanks for fixing all of these!
Attachment #495647 - Flags: review?(dholbert) → review+
With a bit more analysis I see that we do need to have some kind of state machine for reftest-wait tests (and "explicit paint wait" tests, i.e. tests containing plugins). The sequence of events we need to support is:
1) document load event fires
2) all outstanding paints (and explicit paint waits) are flushed
3) fire MozInvalidateEvent
4) wait for reftest-wait to be removed
5) Set print mode, if necessary
6) wait for all outstanding paints (and explicit paint waits) to be flushed, again

This patch introduces an explicit state machine that does that.
Attachment #495652 - Flags: review?(dbaron)
Keywords: checkin-needed
Whiteboard: [needs landing]
Attachment #495645 - Flags: review?(dbaron) → review+
Comment on attachment 495652 [details] [diff] [review]
Part 3: reftest.js harness update

>+// Initial state. When all MozAfterPaint events and all explicit paint waits
>+// are flushed, we can fire the MozReftestInvalidate event and move to the next state.
>+const STATE_WAITING_TO_FIRE_INVALIDATE_EVENT = 0;

s/When/When the doucemnt has loaded and/

(Also, please wrap comments at less than 80 characters.)


Could you make the four gFailureReason assignments inside WaitForTestEnd
more useful?  In particular, I think they should probably be:
  timed out waiting for pending paint count to reach zero
  timed out waiting for reftest-wait to be removed
  timed out waiting for pending paint count to reach zero (after reftest-wait removed and switch to print mode)
  timed out while taking snapshot (bug in harness?)

>+    // We might already be able to advance the state, e.g. if reftest-wait was
>+    // removed in the load event handler. So make sure we check here.

This should also happen if there was no reftest-wait at all, or if
the pending paint count is zero.  I'm not sure you really need this much
of a comment here, but I think this comment is a little misleading.


Maybe it's time to rename DocumentLoaded to RecordSnapshot or something
similar?

Maybe InitCurrentCanvasWithSnapshot should bail if gCurrentCanvas is
non-null?  Or can you avoid calling it twice in the case where the first
call to it leads to pending paints?  It seems more accurate not to
redraw the whole thing in that case.

r=dbaron with that

Are you expecting this to fix the spate of reftest/crashtest timeouts
we've been seeing recently?
Attachment #495652 - Flags: review?(dbaron) → review+
(In reply to comment #5)
> s/When/When the doucemnt has loaded and/
> 
> (Also, please wrap comments at less than 80 characters.)

OK

> Could you make the four gFailureReason assignments inside WaitForTestEnd
> more useful?  In particular, I think they should probably be:
>   timed out waiting for pending paint count to reach zero
>   timed out waiting for reftest-wait to be removed
>   timed out waiting for pending paint count to reach zero (after reftest-wait
> removed and switch to print mode)
>   timed out while taking snapshot (bug in harness?)

OK

> >+    // We might already be able to advance the state, e.g. if reftest-wait was
> >+    // removed in the load event handler. So make sure we check here.
> 
> This should also happen if there was no reftest-wait at all, or if
> the pending paint count is zero.  I'm not sure you really need this much
> of a comment here, but I think this comment is a little misleading.

OK, I'll remove the comment.

> Maybe it's time to rename DocumentLoaded to RecordSnapshot or something
> similar?

How about RecordResult?

> Maybe InitCurrentCanvasWithSnapshot should bail if gCurrentCanvas is
> non-null?  Or can you avoid calling it twice in the case where the first
> call to it leads to pending paints?  It seems more accurate not to
> redraw the whole thing in that case.

Yeah, I can factor the initial snapshot out of WaitForTestEnd so we don't do it twice when we trigger WaitForTestEnd belatedly.

> Are you expecting this to fix the spate of reftest/crashtest timeouts
> we've been seeing recently?

I haven't been paying attention to them, but I am expecting this to fix some random test failures involving plugins.
(In reply to comment #6)
> > Maybe it's time to rename DocumentLoaded to RecordSnapshot or something
> > similar?
> 
> How about RecordResult?

Sure.
It turns out that a bunch of the SVG-as-image tests (img-*.html) now work on Windows, but not with D2D, so I'll re-mark them as random-if(d2d).
Attached patch Part 3 v2Splinter Review
Updated to comments
Attachment #495652 - Attachment is obsolete: true
Attached patch Part 2 v2Splinter Review
Attachment #495647 - Attachment is obsolete: true
I guess anyone could land these now.
Although I probably should do a tryserver run...
blocks blockers
blocking2.0: --- → betaN+
Tryserver run revealed a timeout in pluginproblemui tests that's probably caused by an imbalance in MozPaintWait vs MozPaintWaitFinished. We need to ensure that even if the plugin never paints normally (e.g. because it dies) we still fire MozPaintWaitFinished.

We also need to fix the event dispatch here so that the events fire at safe times.
Attachment #496455 - Flags: review?(dbaron)
I saw one other failure on tryserver, in 582476-1.svg; the contents of the IFRAME were not repainted properly. It looks like there's a style change in the IFRAME that does not require reflow, only repaint, and somehow that doesn't get picked up by the reftest harness.

The style change should be flushed by the call to getBoundingClientRect() on the IFRAME document's root element. That flush should trigger an Invalidate on the frame, which should ensure a MozAfterPaint event is queued for the toplevel chrome document, since the blue rectangle in the IFRAME is visible in the chrome document at this point. And the reftest harness should detect that the event is pending and delay completion until the event has been processed and the canvas updated.

This could be tough to track down.
Whiteboard: [needs landing]
Random shot in the dark: if there's a MozAfterPaint being missed by the harness, maybe bug 617525 is the cause.
Attached patch Part 4 v2Splinter Review
The previous iteration had a disastrous bug where mFinished was not initialized.
Attachment #496455 - Attachment is obsolete: true
Attachment #496724 - Flags: review?(dbaron)
Attachment #496455 - Flags: review?(dbaron)
I saw this assertion on try but I have no idea how it happened. This will spew a bit more debug output that might help.
Attachment #496725 - Flags: review?(joshmoz)
Attached patch Part 6: improve diagnostics (obsolete) — Splinter Review
This builds on the previous patch to add tons of diagnostics to reftest.js. In particular, I've added a gTestLog array which you can push messages onto. The contents of this array are dumped when a test unexpectedly fails. I've added a lot of informational messages that get pushed to that array, so when a test fails, you get a reasonably detailed log of what happened.
Attachment #496726 - Flags: review?(dbaron)
This is slightly wasteful in the case where we do a full paint to detect the presence of plugins and then belatedly enter WaitForTestEnd, but I think we should do it anyway. I don't think we can currently miss any invalidation events between taking a snapshot in AfterOnLoadScripts and adding the listeners in WaitForTestEnd, but I think it pays to be ultra-paranoid about races here and just taking a full snapshot after setting up the listeners makes it clearly safe.
Attachment #496730 - Flags: review?(dbaron)
I don't think this call does anything useful. It converts invalidates deferred at the view manager level into immediate invalidates, but that doesn't affect the timing of MozAfterPaint dispatching, nor the contents of the layer tree, nor results of a drawWindow call, so it should affect us at all. And having it around is confusing since it looks like it might be important.
Attachment #496733 - Flags: review?(dbaron)
While I was poking at test failures here I discovered a rather bad bug already on trunk. nsDisplayPlugin::GetBounds adjusts the bounds to match the size of the plugin surface we have. But it even does that when we're calling GetBounds when using display lists to compute which part of the plugin is visible! So if the plugin's surface is currently empty (because we didn't get the new surface yet), we sometimes set the plugin cliprect to empty to match!

This is one of those "how does anything work at all" moments...
Attachment #496734 - Flags: review?(tnikkel)
Comment on attachment 496734 [details] [diff] [review]
Part 9: Use desired bounds in nsDisplayList::GetBounds, not actual bounds, when computing plugin geometry

We only adjust the rect's position based on the surfaces size, so that's probably why it was working.

I'm a little confused by the context in this patch. The context in your patch seems to add the same point to r that trunk subtracts. Is that a bugfix somewhere else in your queue or am I missing something?
Ah, right. I was testing this in conjunction with the nsObjectFrame part of the patch in bug 579808, which you have reviewed :-).
(In reply to comment #24)
> We only adjust the rect's position based on the surfaces size, so that's
> probably why it was working.

Well, even on trunk it's only "mostly" working. The visible region will be wrong sometimes.
Oh ok, that explains why something about that function seemed a little too familiar.

Since the patch in bug 579808 actually makes this problem worse I think we should make sure that this patch lands at the same time as that hunk by putting both changes into the same bug (a new one, or whichever one makes the most sense).
Let's just move the nsObjectFrame part of the patch into this bug. It's messing with MozPaintWait so it kinda belongs here anyway.

You've already reviewed this.
Attachment #496736 - Flags: review+
Attachment #496734 - Flags: review?(tnikkel) → review+
OK, a few more tryserver failures...

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1291955415.1291956162.24226.gz&fulltext=1#err0
layout/generic/crashtests/348887-1.html contains an IFRAME that contains document.location.reload(), i.e., the IFRAME is constantly reloading. The outer document removes reftest-wait after one second, but it looks like we remain in an endless loop of getting a MozAfterPaint event, run MakeProgress off a setTimeout(0), which flushes rendering, which generates another invalidate. I guess the reload keeps happening during the setTimeout(0).

We can tighten up AfterPaintListener and ExplicitPaintsCompleteListener to run MakeProgress synchronously (since they're currently asynchronous events anyway). But that might not be enough for some testcases. Imagine a testcase that restyles elements in multiple subframes at high frequency. We could constantly be in a situation where we're processing a MozAfterPaint generated from one subframe but there are unflushed restyles in other subframes, so our FlushRendering call constantly generates new invalidates, so there are always pending MozAfterPaints when we ask.

The thing is, that behavior makes no sense for a reftest because you won't get a consistent snapshot unless you let the test stabilize. It only makes sense for crashtests or other tests that don't need a snapshot. And for those tests, we don't care about pending paints being flushed before we end the test. So we should simply let shouldWaitForPendingPaints return false if gCurrentCanvas is null --- we shouldn't wait for pending paints if we're not going to take snapshots.
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1291962041.1291962713.22784.gz&fulltext=1#err0
layout/reftests/bugs/582476-1.svg still fails even with the patch in bug 617525 :-(. The log is kinda interesting although part of it's missing due to a bug that I'll fix momentarily:

REFTEST TEST-START | file:///Users/cltbld/talos-slave/tryserver_snowleopard_test-reftest/build/reftest/tests/layout/reftests/bugs/582476-1.svg
REFTEST INFO | MozInvalidateEvent didn't invalidate

This is actually a bug in my patches. We should call FlushRendering before producing this warning, to ensure that restyles and reflows are flushed before we check whether the MozInvalidateEvent triggered an invalidate.

REFTEST INFO | Internal error: descendant frame generated a MozAfterPaint event, but the root document doesn't have one!

I think this is the real problem. cjones mentioned he'd seen this happen before. I need to figure out how to look deeper, maybe record and replay.
I wonder ... reftest USE_WIDGET_LAYERS is disabled on that box for some reason. That means nsGfxScrollFrameInner::BuildDisplayList doesn't update mScrollPosAtLastPaint when we call drawWindow. I need to think about whether that could cause a problem here.
REFTEST INFO | drawWindow flags = DRAWWINDOW_DRAW_CARET | DRAWWINDOW_DRAW_VIEW; window size = 802,998; test browser size = 800,1000

Oh, the window is two pixels too short! Armen, is there a bug for that problem?
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1291958518.1291958864.4103.gz&fulltext=1#err0
Apparently my debug code crashes. I have no idea what's actually going on here. This line is probably important:
   Searching for plugin mEntryPoint 0
I don't know what it means for instance->GetPlugin() to be null, what we should do in that case, or how any of my changes could have caused this. Josh, any idea?
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1291958608.1291959820.8317.gz&fulltext=1#err4
is another instance of the same thing in a different mochitest.

Looking at the two tests, it looks like it happens when we disable the plugin and then destroy an instance.
(In reply to comment #32)
> REFTEST INFO | drawWindow flags = DRAWWINDOW_DRAW_CARET | DRAWWINDOW_DRAW_VIEW;
> window size = 802,998; test browser size = 800,1000
> 
> Oh, the window is two pixels too short! Armen, is there a bug for that problem?

I think bug 593874. I believe jhford had ideas on how to change the screen resolution.
(In reply to comment #31)
> I wonder ... reftest USE_WIDGET_LAYERS is disabled on that box for some reason.
> That means nsGfxScrollFrameInner::BuildDisplayList doesn't update
> mScrollPosAtLastPaint when we call drawWindow. I need to think about whether
> that could cause a problem here.

That was the problem that caused the bug that that reftest is testing.
I figured out what was causing the nsPluginTag assertion. When a plugin is disabled, all currently running instances of the plugin are immediately destroyed but the nsPluginInstanceOwner (and the nsObjectFrame) still exist. When we come to destroy the nsObjectFrame we try to stop the plugin instance, which gets us back into nsPluginHost::StopPluginInstance on the already-destroyed instance, which leads inexorably to that assertion.
Attachment #496725 - Attachment is obsolete: true
Attachment #497113 - Flags: review?(joshmoz)
Attachment #496725 - Flags: review?(joshmoz)
Attachment #497113 - Attachment description: nsPluginHost::StopPluginInstance should skip plugin instances that are already destroyed → Part 5: nsPluginHost::StopPluginInstance should skip plugin instances that are already destroyed
Attached patch Part 6 v2Splinter Review
Cleaned up some unnecessary \ns. Also, before we warn that MozReftestInvalidate removed reftest-wait but didn't cause any invalidation, call FlushRendering to make sure restyles and reflows are flushed.
Attachment #497114 - Flags: review?(dbaron)
Attachment #496726 - Attachment is obsolete: true
Attachment #496726 - Flags: review?(dbaron)
Here's what I think could be happening with 582476-1.svg:

1) setup() runs during the toplevel document load event, setting the scroll position y to 10000 (say). This triggers invalidation and queues MozAfterPaint events.
2) All MozAfterPaint event handlers run --- but the actual painting of the document to the window does not happen yet.
3a) The harness fires MozReftestInvalidate. The test restyles the IFRAME content which invalidates a rect at y=10050 (say) in the IFRAME.
3b) nsHTMLScrollFrame::InvalidateInternal receives a damage rect relative to the scrollframe --- which in this case is at y=50 (because we've scrolled down to get this rect into view).
3c) nsHTMLScrollFrame::InvalidateInternal does "damage += GetScrollPosition() - mInner.mScrollPosAtLastPaint". This is adjusting the damage rect to match the coordinates used for any retained layers --- i.e. correcting the translation in step 3b to use the scroll position during the last layer tree update. In this case, mScrollPosAtLastPaint is zero so the damage rect ends up at 10050 again. (Indeed, in terms of the last layer tree update, the invalid area is completely out of view.)
3d) That rectangle is outside the scrollport so the invalidation is swallowed here and no MozAfterPaint is generated for the outer document.
Basically the problem is that for invalidating ThebesLayers we need to use coordinates assuming scroll positions as of the last layer tree update. But for the reftest harness to work correctly, we need to MozAfterPaint to return coordinates that are correct for the scroll positions in the current frame tree. (Because our drawWindow is going to use the current frame tree. Even if USE_WIDGET_LAYERS is enabled, we're going to flush changes to the layer tree so it reflects the current frame tree before we take the snapshot.)

Note, for actually invalidating the window it doesn't really matter whether we use old scroll positions or current scroll positions, because if they're different then we must already have a pending invalidation of the window for the entire scrolled area.
Comment on attachment 497115 [details] [diff] [review]
Part 10: Don't wait for pending paints to flush in crashtests

>From: Robert O'Callahan <robert@ocallahan.org>
>Bug 617512. Part 10: Don't wait for pending paints to flush in crashtests. r=dbaron

512 != 152.
Comment on attachment 497113 [details] [diff] [review]
Part 5: nsPluginHost::StopPluginInstance should skip plugin instances that are already destroyed

>+  PRBool IsDestroyed() { return !mPlugin; }

I don't really want to cloud the water with another definition of instance state, can you possibly make use of "mRunning" and maybe even one of the existing public methods based on "mRunning"?
OK, how about using mRunning >= DESTROYING?
Attachment #497113 - Attachment is obsolete: true
Attachment #497228 - Flags: review?(joshmoz)
Attachment #497113 - Flags: review?(joshmoz)
This test fails for me because the opacity calculations in 594333-1.html are done by D3D10 layers, and produce slightly different color channel values from the drawing we do via D2D in 594333-1-ref.html. This patch fixes that.
Attachment #497124 - Flags: review?(tnikkel) → review+
A tryserver push with these patches (and a few others) was entirely green. Yes, really! So I'd like to get reviews and land now :-)
Comment on attachment 497228 [details] [diff] [review]
Part 5 v3: Change HasText to GetComponentAlphaBounds

>+  bool HasStartedDestroying() {
>+    return mRunning > DESTROYING;
>+  }

I think you wanted ">=" not just ">".
Attachment #497228 - Flags: review?(joshmoz) → review-
Comment on attachment 497124 [details] [diff] [review]
Part 11: Use mScrollPosAtLastPaint only for ThebesLayer invalidation

Do we want to make the same changes to nsXULScrollFrame::InvalidateInternal?
Attached patch Part 5 v4Splinter Review
Good catch
Attachment #497308 - Flags: review?(joshmoz)
Attachment #497308 - Flags: review?(joshmoz) → review+
Attachment #497337 - Flags: review?(tnikkel) → review+
Comment on attachment 496724 [details] [diff] [review]
Part 4 v2

r=dbaron
Attachment #496724 - Flags: review?(dbaron) → review+
Comment on attachment 497114 [details] [diff] [review]
Part 6 v2

r=dbaron
Attachment #497114 - Flags: review?(dbaron) → review+
Attachment #496730 - Flags: review?(dbaron) → review+
Attachment #496733 - Flags: review?(dbaron) → review+
Comment on attachment 497115 [details] [diff] [review]
Part 10: Don't wait for pending paints to flush in crashtests

You're sure this doesn't get called before we've set up the canvas, right?
Attachment #497115 - Flags: review?(dbaron) → review+
Yes, InitCurrentCanvasWithSnapshot is called before any of the listeners that call shouldWaitForPendingPaints can be called.
Whiteboard: [needs landing]
Depends on: 620512
Depends on: 620790
Depends on: 621099
(In reply to comment #44)
> I don't really want to cloud the water with another definition of instance
> state, can you possibly make use of "mRunning" and maybe even one of the
> existing public methods based on "mRunning"?

As it turned out, mRunning >= DESTROYING is not the same thing as nsPluginHost::StopPluginInstance having been called on the instance, because in some places we call Stop() outside of StopPluginInstance.
I backed out part 5 to try to fix various crash bugs.
No longer depends on: 620512
No longer depends on: 620955
No longer depends on: 620794
No longer depends on: 620790
Depends on: 620512
Depends on: 620955
Depends on: 620794
Depends on: 620790
Depends on: 622829
Depends on: 1023618
You need to log in before you can comment on or make changes to this bug.