Closed Bug 1369411 Opened 3 years ago Closed 2 years ago

Intermittent layout/base/tests/test_bug394057.html | changing font pref should change width of table (monospace) - got 177, expected 216

Categories

(Core :: Layout: Text and Fonts, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox-esr52 --- unaffected
firefox59 --- wontfix
firefox60 --- fixed
firefox61 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: emilio)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled][Pernosco])

Attachments

(1 file)

I think this has a lower failure rate in the last week (<25), but it is still sort of high.  I don't think we need to make this a high priority, but the right people should be aware of this.

:jet, I see you are the triage owner for many of the core::layout components, but not 'layout: text', could you help me find a triage owner/contact on the 'layout: text' team?
Flags: needinfo?(bugs)
Whiteboard: [stockwell needswork]
(In reply to Joel Maher ( :jmaher) from comment #3)
> I think this has a lower failure rate in the last week (<25), but it is
> still sort of high.  I don't think we need to make this a high priority, but
> the right people should be aware of this.
> 
> :jet, I see you are the triage owner for many of the core::layout
> components, but not 'layout: text', could you help me find a triage
> owner/contact on the 'layout: text' team?

You can ping me for those as well. Are you able to update this in bugzilla? Thx!
Flags: needinfo?(bugs)
Whiteboard: [stockwell needswork] → [stockwell unknown]
Priority: -- → P3
this increased in frequency, specifically on linux.

here is a recent log:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20x64%20clipboard%20debug%20stylo&tochange=41e3ae0e04b1bcf94520145618ecc279195d8900&fromchange=933be59f89bdd36a320527bc29b44566bae9ad3b&selectedJob=136826425

and the related screenshot:
https://public-artifacts.taskcluster.net/dlCh5nWPSwuuhFjBrJVh0Q/0/public/test_info//mozilla-test-fail-screenshot_qvDnhn.png


here is some data from the above log:
[task 2017-10-18T04:00:52.347Z] 04:00:52     INFO - TEST-START | layout/base/tests/test_bug394057.html
[task 2017-10-18T04:00:52.458Z] 04:00:52     INFO - TEST-INFO | started process screentopng
[task 2017-10-18T04:00:52.941Z] 04:00:52     INFO - TEST-INFO | screentopng: exit 0
[task 2017-10-18T04:00:52.945Z] 04:00:52     INFO - Buffered messages logged at 04:00:52
[task 2017-10-18T04:00:52.945Z] 04:00:52     INFO - TEST-PASS | layout/base/tests/test_bug394057.html | can't find serif and monospace fonts of different width 
[task 2017-10-18T04:00:52.946Z] 04:00:52     INFO - TEST-PASS | layout/base/tests/test_bug394057.html | changing font pref should change width of table (serif) 
[task 2017-10-18T04:00:52.946Z] 04:00:52     INFO - Buffered messages finished
[task 2017-10-18T04:00:52.946Z] 04:00:52     INFO - TEST-UNEXPECTED-FAIL | layout/base/tests/test_bug394057.html | changing font pref should change width of table (monospace) - got 177, expected 216
[task 2017-10-18T04:00:52.946Z] 04:00:52     INFO -     SimpleTest.is@SimpleTest/SimpleTest.js:312:5
[task 2017-10-18T04:00:52.946Z] 04:00:52     INFO -     step3@layout/base/tests/test_bug394057.html:79:5
[task 2017-10-18T04:00:52.946Z] 04:00:52     INFO - GECKO(3349) | MEMORY STAT | vsize 657MB | residentFast 97MB | heapAllocated 22MB
[task 2017-10-18T04:00:52.946Z] 04:00:52     INFO - TEST-OK | layout/base/tests/test_bug394057.html | took 187ms


:jfkthame, is there anything you can see here that would be a simple fix?
Flags: needinfo?(jfkthame)
Whiteboard: [stockwell unknown] → [stockwell needswork]
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #24)
> this increased in frequency, specifically on linux.
> 
> here is a recent log:
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-
> searchStr=win%20x64%20clipboard%20debug%20stylo&tochange=41e3ae0e04b1bcf94520
> 145618ecc279195d8900&fromchange=933be59f89bdd36a320527bc29b44566bae9ad3b&sele
> ctedJob=136826425
> 
> and the related screenshot:
> https://public-artifacts.taskcluster.net/dlCh5nWPSwuuhFjBrJVh0Q/0/public/
> test_info//mozilla-test-fail-screenshot_qvDnhn.png

Uhhh... this screenshot doesn't seem to show anything about the failure; it's happily green, and says test_bug394057.html is "currently executing". Is it taken a little too early?

And I can't seem to find this failure at that treeherder link, either. Did you copy/paste the wrong URL, by any chance?
Flags: needinfo?(jfkthame) → needinfo?(jmaher)
:jfkthame, sorry, I did the wrong thing here :(

Here is a log (probably different from the text snippet above, but similar):
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=137336296&lineNumber=8795

and the screenshot:
https://public-artifacts.taskcluster.net/MjhYfumfQAmvYO8IkWRrdQ/0/public/test_info//mozilla-test-fail-screenshot_x26qN3.png
Flags: needinfo?(jmaher)
There's still something weird about this so-called "test-fail-screenshot"... it's clearly being taken BEFORE the failure! It doesn't show the state of the test window once the failure has occurred, but before.

And the log seems to be consistent with this:

[task 2017-10-16T23:38:46.466Z] 23:38:46     INFO - TEST-START | layout/base/tests/test_bug394057.html
[task 2017-10-16T23:38:46.591Z] 23:38:46     INFO - GECKO(2076) | ++DOMWINDOW == 32 (0xd7891000) [pid = 2123] [serial = 68] [outer = 0xd8d6d000]
[task 2017-10-16T23:38:46.635Z] 23:38:46     INFO - GECKO(2076) | [Child 2123, Main Thread] WARNING: stylo: Web Components not supported yet: file /builds/worker/workspace/build/src/dom/base/nsDocument.cpp, line 6455
[task 2017-10-16T23:38:46.637Z] 23:38:46     INFO - GECKO(2076) | [Child 2123, Main Thread] WARNING: stylo: Web Components not supported yet: file /builds/worker/workspace/build/src/dom/base/nsDocument.cpp, line 6455
[task 2017-10-16T23:38:46.956Z] 23:38:46     INFO - TEST-INFO | started process screentopng
[task 2017-10-16T23:38:47.544Z] 23:38:47     INFO - TEST-INFO | screentopng: exit 0
[task 2017-10-16T23:38:47.545Z] 23:38:47     INFO - Buffered messages logged at 23:38:46
[task 2017-10-16T23:38:47.545Z] 23:38:47     INFO - TEST-PASS | layout/base/tests/test_bug394057.html | can't find serif and monospace fonts of different width 
[task 2017-10-16T23:38:47.546Z] 23:38:47     INFO - TEST-PASS | layout/base/tests/test_bug394057.html | changing font pref should change width of table (serif) 
[task 2017-10-16T23:38:47.546Z] 23:38:47     INFO - Buffered messages finished
[task 2017-10-16T23:38:47.547Z] 23:38:47     INFO - TEST-UNEXPECTED-FAIL | layout/base/tests/test_bug394057.html | changing font pref should change width of table (monospace) - got 177, expected 216
[task 2017-10-16T23:38:47.548Z] 23:38:47     INFO -     SimpleTest.is@SimpleTest/SimpleTest.js:312:5
[task 2017-10-16T23:38:47.549Z] 23:38:47     INFO -     step3@layout/base/tests/test_bug394057.html:79:5
[task 2017-10-16T23:38:47.550Z] 23:38:47     INFO - GECKO(2076) | MEMORY STAT | vsize 673MB | residentFast 112MB | heapAllocated 20MB
[task 2017-10-16T23:38:47.551Z] 23:38:47     INFO - TEST-OK | layout/base/tests/test_bug394057.html | took 689ms

Note how the screentopng process is run at 23:38:46.956, and has exited by 23:38:47.544; but the test failure isn't logged until 23:38:47.547.

Which means the screenshot is not useful for assessing the state of the test document when the failure actually occurred. This seems like it indicates a bug in the test harness, or something like that?


As for the actual reason for the test failures... I don't have any clear idea at the moment. The test involves changing a font preference (several times), and observing the resulting changes to layout; maybe SpecialPowers.pushPrefEnv isn't entirely reliable about waiting for the changes to take effect before it executes its callback? But that's not code I know anything about, really.
I took a look over the pushPrefEnv:
http://searchfox.org/mozilla-central/source/testing/specialpowers/content/specialpowersAPI.js#1062

we set a pref, then wait for an observer on that pref to be called and when it is we then call the callback function:
http://searchfox.org/mozilla-central/source/testing/specialpowers/content/specialpowersAPI.js#1194

Is it possible the observer for a specific pref can be called without the change taking place?  Is it possible that when changing font.name.serif.x-western, there are async operations that need to take place in layout before we really see the effect of the font change?

This failure is closing in on our disable threshold (200 failures in 30 days), so I wanted to look at it in more detail before it gets there.
Flags: needinfo?(jfkthame)
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/749e673955af
Disable layout/base/tests/test_bug394057.html on linux for frequent failures. r=me, a=testonly
I disabled this for frequent failures- please remember to enable this when working on a fix
Keywords: leave-open
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
as a reminder this test is still disabled
So I investigated this with roc's new debugging tool, which had a recording of a failure of this test.

It makes perfect sense that's intermittent, fwiw :).

So the test is changing a font pref, then reading out offsetWidth, makes sense so far. Except the reflow that happens for pref changes runs off a timer [1], so by the time we read offsetWidth, the pref change still hasn't made any effect, so we read the old value, which is of course different, and the test fails.

[1]: https://searchfox.org/mozilla-central/rev/14d933246211b02f5be21d2e730a57cf087c6606/layout/base/nsPresContext.cpp#835
Whiteboard: [stockwell disabled] → [stockwell disabled][Pernosco]
(In reply to Emilio Cobos Álvarez [:emilio] from comment #56)
> So I investigated this with roc's new debugging tool, which had a recording
> of a failure of this test.
> 
> It makes perfect sense that's intermittent, fwiw :).
> 
> So the test is changing a font pref, then reading out offsetWidth, makes
> sense so far. Except the reflow that happens for pref changes runs off a
> timer [1], so by the time we read offsetWidth, the pref change still hasn't
> made any effect, so we read the old value, which is of course different, and
> the test fails.
> 
> [1]:
> https://searchfox.org/mozilla-central/rev/
> 14d933246211b02f5be21d2e730a57cf087c6606/layout/base/nsPresContext.cpp#835

So, the thing is that pushPrefEnv kind of already tries to deal with this situation though, delaying the callback a bit (which was going to be my fix for this :P):

 https://searchfox.org/mozilla-central/rev/61d400da1c692453c2dc2c1cf37b616ce13dea5b/testing/specialpowers/content/specialpowersAPI.js#1144

Ben, do you know if the way things stand something may go awry in a way that, even with the hoops the function above goes through, a timer like:

 https://searchfox.org/mozilla-central/rev/61d400da1c692453c2dc2c1cf37b616ce13dea5b/layout/base/nsPresContext.cpp#835

Could not have fired by the time the callback runs?

If so, what's a more reliable solution? Maybe posting a runnable to the associated document? That should be trivial.
Flags: needinfo?(bkelly)
There is a difference between content script `setTimeout(f, 0)`, Timer.jsm `setTimeout(f, 0)`, and creating an nsITimer with delay 0.

As an added complication SpecialPower.pushPrefEnv() code can end up using either content script setTimeout() or the Timer.jsm setTimeout():

https://searchfox.org/mozilla-central/rev/61d400da1c692453c2dc2c1cf37b616ce13dea5b/testing/specialpowers/content/specialpowersAPI.js#747-754

The Timer.jsm `setTimeout(f, 0)` basically creates an nsITimer with delay 0.  So that is the same as what nsPresContext.cpp is doing.  This essentially locks a mutex, adds an entry in nsTimerThread's list, and then you wait for nsTimerThread to wake up.  At that point it fires a runnable back at the original thread.  That could be very fast, or it could be delayed a bit.  On windows platforms I've seen nsTimerThread fire a few milliseconds late since its using primitives that only have a millisecond resolution.

In contrast, the content window `setTimeout(f, 0)` is a bit different.  You can hit one of these paths:

1) The `setTimeout(f, 0)` dispatches a runnable to the current thread if its not "nested deeply".
2) If its nested deeply, then `setTimeout(f, 0)` clamps to a 4ms delay and uses nsITimer to accomplish that.
3) In both cases, the scheduled callback from a content script `setTimeout()` can be coallesced with other callbacks scheduled to run at the same time or earlier.  This means another runnable can already be in the event queue and effectively jump your new `setTimeout(f,0)` up in the event queue from where it normally would have run.

Anyway, back to the problem at hand...

Is there a good reason for nsPresContext.cpp to use an nsITimer with a delay of 0?  A normal runnable would be faster and more efficient in terms of CPU/memory.  If you can do that without breaking a bunch of other tests, then I think thats your best bet.

The SpecialPower.pushPrefEnv() thing is a real hack and I don't have a solution for fixing it.  Maybe we should ask :njn to look at providing a proper callback for preference changes.

Hope that helps.
Flags: needinfo?(bkelly)
Or if you need to wait for reflow in the test for the pref change to be effective, maybe it should be using requestAnimationFrame() after awaiting the pushPrefEnv?
(In reply to Ben Kelly [:bkelly] from comment #59)
> Or if you need to wait for reflow in the test for the pref change to be
> effective, maybe it should be using requestAnimationFrame() after awaiting
> the pushPrefEnv?

No, runnable plus what pushPrefEnv is doing should be enough I suspect. Thanks a lot!
ni? me to remember to write the patch when I'm done with what I'm working on now :)
Flags: needinfo?(emilio)
Assignee: nobody → emilio
Flags: needinfo?(emilio)
Comment on attachment 8956011 [details]
Bug 1369411: Use a runnable instead of a timer for pref changes in nsPresContext.

Jonathan seems a bit inactive lately and this should be straight-forward. Bobby, mind reviewing?
Attachment #8956011 - Flags: review?(jfkthame) → review?(bobbyholley)
Comment on attachment 8956011 [details]
Bug 1369411: Use a runnable instead of a timer for pref changes in nsPresContext.

https://reviewboard.mozilla.org/r/224958/#review233502

From the discussion in bug 303987, it looks like this uses a timer because we wanted cancellation, and there wasn't an easy way to have cancellable runnables back then.

So it seems like we should be using CancellableRunnable here, and storing a ref to that rather than a boolean. We could consider dropping that if there's some reason why cancelability was important back then and isn't today, but it seems easy enough to just make it cancelable.

::: layout/base/nsPresContext.h:1234
(Diff revision 1)
>  
>    void PreferenceChanged(const char* aPrefName);
>    static void PrefChangedCallback(const char*, void*);
>  
>    void UpdateAfterPreferencesChanged();
> -  static void PrefChangedUpdateTimerCallback(nsITimer *aTimer, void *aClosure);
> +  void SendPrefChangedRunnableIfNeeded();

Nit: Call this "Dispatch" instead of "Send". The latter implies some sort of messaging or IPC that isn't present here.
Attachment #8956011 - Flags: review?(bobbyholley) → review-
Comment on attachment 8956011 [details]
Bug 1369411: Use a runnable instead of a timer for pref changes in nsPresContext.

https://reviewboard.mozilla.org/r/224958/#review233502

I don't think adding a cancellable runnable vs. an "already detached" pointer is worth the churn. I bailed when `mShell` is already detached instead, though rebuildAllStyleData already did that, plus it's async now, so in practice this only prevents us from querying the prefs unnecessarily.

I didn't change the syscolor stuff and what not because those tweak global state and were a bit more risky / difficult to prove right.
Comment on attachment 8956011 [details]
Bug 1369411: Use a runnable instead of a timer for pref changes in nsPresContext.

https://reviewboard.mozilla.org/r/224958/#review233652

::: layout/base/nsPresContext.cpp:853
(Diff revision 2)
> +  if (!mShell) {
> +    return;
> +  }

Nit: I think this should go before the mContainer check.
Attachment #8956011 - Flags: review?(bobbyholley) → review+
Pushed by ecoal95@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/dd1c00afe429
Use a runnable instead of a timer for pref changes in nsPresContext. r=bholley
Pushed by philringnalda@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d36521a0a4ea
followup, redisable test_bug394057.html on Android, where it still fails, on a CLOSED TREE
Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(jfkthame)
Resolution: --- → FIXED
Keywords: leave-open
Target Milestone: --- → mozilla61
Annoyingly, this is still happening on Beta on a semi-regular basis. How risky of an uplift would this be at this point in the cycle? Or should I just skip the test instead to avoid having to star it for the next year on ESR60? :)
Flags: needinfo?(emilio)
Comment on attachment 8956011 [details]
Bug 1369411: Use a runnable instead of a timer for pref changes in nsPresContext.

I don't think it's a risky uplift.

Approval Request Comment
[Feature/Bug causing the regression]: n/a
[User impact if declined]: none
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: only changes the way we send the task to trigger pref changes so that it doesn't make tests flaky.
[String changes made/needed]: none
Flags: needinfo?(emilio)
Attachment #8956011 - Flags: approval-mozilla-beta?
Comment on attachment 8956011 [details]
Bug 1369411: Use a runnable instead of a timer for pref changes in nsPresContext.

ok let's take this.  approved for 60.0b14.
Attachment #8956011 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.