Closed Bug 1089417 Opened 10 years ago Closed 6 years ago

fail to restyle for media query feature changes between adding/removing an empty stylesheet and next event that forces style computation (breaks golem.de on mobile)

Categories

(Core :: CSS Parsing and Computation, defect)

ARM
Android
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
mozilla37
Tracking Status
firefox34 --- unaffected
firefox35 + fixed
firefox36 + wontfix
firefox37 - affected
firefox38 - affected
firefox39 --- affected
fennec + ---

People

(Reporter: c.ascheberg, Assigned: dbaron, NeedInfo)

References

()

Details

(Keywords: regression)

Attachments

(13 files)

152.18 KB, image/png
Details
6.97 KB, patch
heycam
: review+
Details | Diff | Splinter Review
1000 bytes, patch
heycam
: review+
Details | Diff | Splinter Review
2.79 KB, patch
heycam
: review+
Details | Diff | Splinter Review
3.24 KB, patch
heycam
: review+
Details | Diff | Splinter Review
3.56 KB, patch
heycam
: review+
Details | Diff | Splinter Review
7.93 KB, patch
heycam
: review+
Details | Diff | Splinter Review
3.66 KB, patch
heycam
: review+
Details | Diff | Splinter Review
3.46 KB, patch
heycam
: review+
Details | Diff | Splinter Review
1.59 KB, patch
Details | Diff | Splinter Review
1.22 KB, patch
Details | Diff | Splinter Review
1.38 KB, patch
heycam
: review+
Details | Diff | Splinter Review
2.34 KB, patch
heycam
: review+
Details | Diff | Splinter Review
Attached image screenshot
Open website http://www.golem.de in Firefox for Android and scroll down. If the website looks ok, try reloading and check again.
For me, in most cases, the font size is too big an the pictures are placed incorrectly, see screenshot. Tested on a Nexus 4.

This regressed between 2014-10-09 and 2014-10-10:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f0bb13ef0ee4&tochange=50b689feab5f
[Tracking Requested - why for this release]: regression
tracking-fennec: --- → ?
Window will need to be narrowed on inbound
I did some more testing, I think the bug was introduced somewhere in http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f0bb13ef0ee4&tochange=f5557f04db0b
I was able to reproduce the bug in https://hg.mozilla.org/integration/mozilla-inbound/rev/5839fbd7b8c6 but not in https://hg.mozilla.org/integration/mozilla-inbound/rev/7037d68868eb (sometimes it may take some tries to trigger it though)
Flags: needinfo?(dbaron)
(In reply to Christian Ascheberg from comment #4)
> I was able to reproduce the bug in
> https://hg.mozilla.org/integration/mozilla-inbound/rev/5839fbd7b8c6 but not
> in https://hg.mozilla.org/integration/mozilla-inbound/rev/7037d68868eb
> (sometimes it may take some tries to trigger it though)

One of those changesets is a comment-only one, so it should be possible to narrow further or the bisection went wrong.
The changelog is this: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=7037d68868eb&tochange=5839fbd7b8c6
All commits are for bug 1047928.
The builds that the bisection is based on are this https://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android/1412803747/ and the previous one.
I did not find separate builds for each of the commits in that folder, is there another way?
No, they were landed as one commit. It's possible they don't even compile independently. I see dbaron has needinfo?ed himself on this bug so I suppose he's looking into it.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #7)
> No, they were landed as one commit. 

That should've said push, not commit. But yes, basically there's no builds made for the separate changesets.
I'm hoping that the fix for bug 1086937 will also fix this, though it's not clear whether that will be the case.

(Is this reproducable on other platforms?  I'm currently set up to debug on Firefox OS or on Firefox desktop (Linux) much more easily than on Android.)
tracking-fennec: ? → 35+
Kevin - Can you retest?
Flags: needinfo?(kbrosnan)
Note that bug 1086937 probably did not make this morning's nightly, but should make the next one.
The bug still occurs in the current Android Nightly (acbd7b68fa8c)
Flags: needinfo?(kbrosnan)
Flags: needinfo?(dbaron)
Flags: needinfo?(dbaron)
I haven't been able to see the bug.  I tried on a Galaxy Tab (with and without the Display -> Text Reflow setting enabled) running Firefox for Android and on a Flame running Firefox OS.

So I'll try generating a bunch of builds to bisect which part of the patch series caused it, so that you can test them.
Er, except all our build infrastructure is down today (try server is closed), so I guess I'll do that some other time.
Summary: regression: website not rendered correctly → regression: golem.de not rendered correctly
And, really, you should replace all those URLs with https: rather than http.  So, actually, let me retype the entire comment:

OK, builds to test are currently building.  Only the first of the builds below is done, but the rest should be ready within about 30 minutes:

build after patch 3:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dbaron@mozilla.com-91c8a953f474/try-android/fennec-35.0a1.en-US.android-arm.apk

build after patch 6:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dbaron@mozilla.com-5382b040fb5f/try-android/fennec-35.0a1.en-US.android-arm.apk

build after patch 7:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dbaron@mozilla.com-601f7b0dd154/try-android/fennec-35.0a1.en-US.android-arm.apk

build after patch 8:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dbaron@mozilla.com-4bb3fcef6b3f/try-android/fennec-35.0a1.en-US.android-arm.apk

build after patch 9:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dbaron@mozilla.com-81a240132940/try-android/fennec-35.0a1.en-US.android-arm.apk

build after patch 11:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dbaron@mozilla.com-9b35cd8bb6b0/try-android/fennec-35.0a1.en-US.android-arm.apk

build after patch 12:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dbaron@mozilla.com-27389167dee8/try-android/fennec-35.0a1.en-US.android-arm.apk

build after patch 13:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dbaron@mozilla.com-8bdc5bc7e98e/try-android/fennec-35.0a1.en-US.android-arm.apk


If you could let me know which of the builds show the problem and which don't (and, in particular, at which point along the sequence things change from working to broken) that would be quite helpful.


(It would also be helpful if you can think of anything that might lead to me seeing the problem on my own device -- I'm not sure what's different, or if I'm just looking at the wrong thing.)
I was able to reproduce the bug in builds "after patch 7", 8 and 9, but *not* in builds "after patch 3" and "after patch 6", and did not test further builds.
I could not reproduce this in Linux Desktop Nightly, and I do not have any other devices at hand.
Flags: needinfo?(c.ascheberg)
OK, thanks, that says this is a regression from https://hg.mozilla.org/integration/mozilla-inbound/rev/a1bc385f0ad4 .

That makes it seem like the conditions for hitting the bug are more likely timing-related (since it's related to when the downloaded fonts finish loading) than device-related (other than perhaps mobile-site vs. desktop-site).

Not sure yet how to proceed, but I'll think about it.
OK, I've debugged this to the point of finding that the new (faster) replacement codepath is continuing to match a rule
  #grandwrapper > footer > nav li { background-color: rgb(0, 0, 0); width: 20; }
whereas rule matching no longer includes that rule.

(They both include a similar rule where width is 33.3333.)

I'm guessing this has something to do with media queries.
The style sheet is http://www.golem.de/staticrl/styles/golem_main-mob_61-min.css .

It's in an @media-rule with @media screen and (min-width:768px).
So I believe this is an existing bug with media queries that was just uncovered by making a performance optimization for downloadable fonts -- where the old, unoptimized, code was wiping out the effects of the bug.

I suspect it might be Android-only because we're doing weird things with the display size, and actually getting partway through loading the page before we tell it the correct display size for the purpose of media queries (which is in itself a performance problem, since we're paying the cost of dynamic style changes resulting from changing media queries).

But the basic problem is this:
 (1) get far enough in the process of pageload to have computed style for elements

 (2) do something that changes the result of a media query, which calls
     nsPresContext::PostMediaFeatureValuesChangedEvent

 (3) before that event is handled, do something to a style sheet that makes us call
     nsStyleSet::DirtyRuleProcessors(eDocSheet), which destroys and recreates the
     nsCSSRuleProcessor for the document level of the cascade.

 (4) resulting from (2), do nsPresContext::MediaFeatureValuesChanged, which calls
     nsStyleSet::MediumFeaturesChanged, which calls
     nsCSSRuleProcessor::MediumFeaturesChanged on the newly-created (in (3)) rule
     processor, which returns false because mRuleCascades is null.


It's not completely clear to me what the problem is yet.  Whatever called (3) should presumably also have called PresShell::RecordStyleSheetChange.  Might it have done so not inside of an update?  I should check.

But the optimization in nsCSSRuleProcessor::MediumFeaturesChanged also seems sketchy -- perhaps it should live in the style set instead, though it would have more false positives where we'd restyle for a media change that way.
Ah, so the missing step is this:

Item (3) above is coming from a call to nsDocument::SetStyleSheetApplicableState on a sheet whose HasRules() method returns false.  This means that when the document calls AddStyleSheetToStyleSets -> nsStyleSet::AddDocStyleSheet -> nsStyleSet::DirtyRuleProcessors, we blow away the old rule processor.

But when the document calls PresShell::StyleSheetApplicableStateChanged, we don't call RecordStyleSheetChange because of the HasRules() check there.

I think that if these HasRules() checks matched between nsStyleSet::DirtyRuleProcessors and the callers of PresShell::RecordStyleSheetChange, we'd be ok.
To confirm my theory that the problem is the mismatch between the HasRules() checks, I'm going to try removing the checks from PresShell::StyleSheet{Added,Removed,ApplicableStateChanged} and see if that makes the bug go away.

If it does, I think the lowest risk fix would be to add a HasRules() check to relevant callers of nsStyleSet::DirtyRuleProcessors, although that's awfully fragile.
(In reply to David Baron [:dbaron] (UTC-8) (needinfo? for questions) from comment #23)
> If it does, I think the lowest risk fix would be to add a HasRules() check
> to relevant callers of nsStyleSet::DirtyRuleProcessors, although that's
> awfully fragile.

And it also doesn't work, since we need a new rule processor that has the correct list of sheets.
I think my plan is to make two patches, either of which alone will fix the bug:

 (1) when we skip restyling because of the HasRules check, do something to ensure we rebuild the rule cascades anyway -- or at least rebuild enough for the media query test.  (Maybe I should just do this by ensuring that once the rule processor has a rule cascades, its replacement does as well.)

 (2) only skip MediumFeaturesChanged if we've never computed style at the style set level, rather than at the rule processor level (this will make things less fragile, but would regress performance without (1) because we'd get spurious restyles when the old rule cascades was null)


I'll hopefully get a chance to try that tomorrow.

At this point I at least have a mochitest that shows the bug in my patch queue.
Another alternative is actually to store a centralized nsMediaQueryResultCacheKey in addition to the one on the RuleCascadeData, and use that to determine the result for MediumFeaturesChanged instead of the return values from the individual rule processors.
Assigning to David since he is actively working on it
Assignee: nobody → dbaron
This was just something that seemed worth asserting in the process of
debugging, since I wanted to see if it was the problem.
Flags: needinfo?(dbaron)
Attachment #8528921 - Flags: review?(cam)
This was just something that seemed worth asserting in the process of
debugging, since I wanted to see if it was the problem.
Attachment #8528922 - Flags: review?(cam)
Note that if I make the style element currently /* empty */ have a rule
in it, the test passes.

Patch 8 also makes the test pass.
Attachment #8528923 - Flags: review?(cam)
This is needed for the equality comparison in
nsCSSRuleProcessor::MediumFeaturesChanged in patch 8.
Attachment #8528925 - Flags: review?(cam)
This depends on patches 5 and 6, and is needed for patch 8.
Attachment #8528928 - Flags: review?(cam)
Note that I did all the patch upload manually because of bug 1105122, so I may have messed up uploading patches.  If I uploading the wrong one, they're in https://hg.mozilla.org/users/dbaron_mozilla.com/patches/ .
One other point:  this isn't actually a regression from bug 1047928; it was an existing bug.  The reduction in selector matching caused by the performance optimization in bug 1047928 uncovered that bug on golem.de, where the rerunning of selector matching following a downloadable font load was (prior to bug 1047928) covering it up.
Summary: regression: golem.de not rendered correctly → fail to restyle for media query feature changes between adding/removing an empty stylesheet and next event that forces style computation (breaks golem.de on mobile)
Component: General → CSS Parsing and Computation
Product: Firefox for Android → Core
Attachment #8528921 - Flags: review?(cam) → review+
Attachment #8528922 - Flags: review?(cam) → review+
Attachment #8528923 - Flags: review?(cam) → review+
Attachment #8528925 - Flags: review?(cam) → review+
Comment on attachment 8528926 [details] [diff] [review]
patch 5 - Add method to save the current media query result cache key from a rule processor.

Review of attachment 8528926 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/style/nsCSSRuleProcessor.cpp
@@ +2918,5 @@
>  
> +UniquePtr<nsMediaQueryResultCacheKey>
> +nsCSSRuleProcessor::SaveMQCacheKey()
> +{
> +  RuleCascadeData *c = mRuleCascades;

"*" next to type.

::: layout/style/nsCSSRuleProcessor.h
@@ +121,5 @@
>  
>    virtual bool MediumFeaturesChanged(nsPresContext* aPresContext) MOZ_OVERRIDE;
>  
> +  /**
> +   * If this rule processor currently have a substantive media query

s/have/has/

@@ +124,5 @@
> +  /**
> +   * If this rule processor currently have a substantive media query
> +   * result cache key, return a copy of it.
> +   */
> +  mozilla::UniquePtr<nsMediaQueryResultCacheKey> SaveMQCacheKey();

To me, "Save" sounds like the wrong word here.  Shouldn't Save be in the method name if it's a method on the nsCSSRuleProcessor instead?  Can we just call it GetMQCacheKey() instead?
Attachment #8528926 - Flags: review?(cam)
Comment on attachment 8528927 [details] [diff] [review]
patch 6 - Pass the previous CSS rule processor to the constructor of the new one (when we replace one with another).

Review of attachment 8528927 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/style/nsStyleSet.cpp
@@ +429,5 @@
> +      // can retrieve them efficiently, even in edge cases like the
> +      // simultaneous removal and addition of a large number of elements
> +      // with scoped sheets.
> +      nsDataHashtable<nsPtrHashKey<Element>,
> +                      nsCSSRuleProcessor*> oldRuleProcessorHash;

Call this oldScopedRuleProcessorHash?
Attachment #8528927 - Flags: review?(cam) → review+
Comment on attachment 8528928 [details] [diff] [review]
patch 7 - Save the previous media query cache key on the rule processor.

Review of attachment 8528928 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/style/nsCSSRuleProcessor.cpp
@@ +1092,5 @@
>                                           aPreviousCSSRuleProcessor)
>    : mSheets(aSheets)
>    , mRuleCascades(nullptr)
> +  , mPreviousCacheKey(aPreviousCSSRuleProcessor
> +                       ? Move(aPreviousCSSRuleProcessor->SaveMQCacheKey())

Is Move() needed here?  I read in UniquePtr.h that it isn't if its argument is a temporary.
Attachment #8528928 - Flags: review?(cam) → review+
Comment on attachment 8528929 [details] [diff] [review]
patch 8 - Only drop MediumFeaturesChanged on the floor if we've never computed style before, rather than never computed style using this rule processor.

Review of attachment 8528929 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/style/nsCSSRuleProcessor.cpp
@@ +2914,5 @@
> +  // enabled).  And if there's nothing cached, it doesn't matter if
> +  // anything changed.  But in the cases where it does matter, we've
> +  // cached a previous cache key to test against, instead of our current
> +  // rule cascades.  See bug 448281 and bug 1089417.
> +  MOZ_ASSERT(!mRuleCascades || !mPreviousCacheKey);

Maybe MOZ_ASSERT(!(mRuleCascades && mPreviousCacheKey)) would express the intent better?
Attachment #8528929 - Flags: review?(cam) → review+
Comment on attachment 8528926 [details] [diff] [review]
patch 5 - Add method to save the current media query result cache key from a rule processor.

Not sure if you meant to review+ this.

Your comments in comment 41 on it sound fine, although I'd probably prefer either CopyMQCacheKey or CloneMQCacheKey rather than Get, since it returns a new object.
Attachment #8528926 - Flags: review?(cam)
(In reply to David Baron [:dbaron] (UTC-8) (needinfo? for questions) from comment #45)
> Your comments in comment 41 on it sound fine, although I'd probably prefer
> either CopyMQCacheKey or CloneMQCacheKey rather than Get, since it returns a
> new object.

Either of those names sound fine.
Attachment #8528926 - Flags: review?(cam) → review+
I'll merge this into the relevant patches.

The errors were:

gecko/layout/style/nsCSSRuleProcessor.cpp:1097: error: no match for ternary 'operator?:' in '(aPreviousCSSRuleProcessor != 0u) ? nsCSSRuleProcessor::CloneMQCacheKey()() : 0'

gecko/layout/style/nsCSSRuleProcessor.cpp:2960: error: conversion from 'int' to non-scalar type 'mozilla::UniquePtr<nsMediaQueryResultCacheKey>' requested
Flags: needinfo?(dbaron)
While I'm not sure how important a site golem.de is, it's not preposterous that this particular combination of two bugs might show up on other sites (probably on Android only, though).  So I'm inclined to think that we should get this fixed on branches even though it isn't, strictly speaking, a regression.
Comment on attachment 8528925 [details] [diff] [review]
patch 4 - Define equality operators on nsMediaQueryResultCacheKey and the types it contains.

Approval Request Comment
[Feature/regressing bug #]: Bug 1047928, in particular https://hg.mozilla.org/integration/mozilla-inbound/rev/a1bc385f0ad4 , caused this to appear on golem.de.  I believe the bug was present in the original implementation of media queries, though.
[User impact if declined]: Bad styling (styles appearing for the wrong media query results, or potentially for a mix of media query results) on golem.de, and perhaps plausibly on other sites.
[Describe test coverage new/current, TBPL]: Mochitest in tree; patch green on inbound.
[Risks and why]: I believe the changes are relatively low risk given that what they do is cause recomputation of style in slightly more cases than before, although a patch with 142 lines of code (collectively) has the usual risk of basic mistakes in a patch of that size.
[String/UUID change made/needed]: no
Attachment #8528925 - Flags: approval-mozilla-aurora?
Attachment #8528926 - Flags: approval-mozilla-aurora?
Attachment #8528927 - Flags: approval-mozilla-aurora?
Attachment #8528928 - Flags: approval-mozilla-aurora?
Attachment #8528929 - Flags: approval-mozilla-aurora?
Approval Request Comment
[Feature/regressing bug #]: see comment 52
[User impact if declined]: see comment 52
[Describe test coverage new/current, TBPL]: none
[Risks and why]: lower risk patch for beta:  back out the optimization that caused this
[String/UUID change made/needed]: no
Attachment #8533053 - Flags: approval-mozilla-beta?
Attachment #8533053 - Attachment is patch: true
Attachment #8533053 - Attachment mime type: message/rfc822 → text/plain
Attachment #8528929 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8528928 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8528927 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8528926 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8528925 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8533053 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Hmm I still sometimes see this bug in Nightly. I can check if the bug is still reproducible in Beta builds some time later.
I installed the current Android Firefox Beta on my phone and I think that version should have the patch. But I still sometimes see this bug there as well, especially when loading the website for the first time. Reloading the website seems to be less problematic.
Flags: needinfo?(dbaron)
If you see the bug some of the time in Beta, that might just mean that it was present some of the time prior to the change identified as the cause in comment 17 (which was backed out for Beta), and that the restyle caused by the downloadable font load covered up the problem only most of the time rather than all of the time.

I'm more puzzled by your report that you're seeing it in nightly.  I guess I should see if I can borrow a Nexus 4 again before the holidays...
Flags: needinfo?(dbaron)
Flags: needinfo?(dbaron)
I am using Nightly as my default daily mobile browser, and as I said, I still see the bug there sporadically. And I can reproduce it in Beta. I just could never reproduce it in the 34 release version, not when I tried before I filed this, and not now.
Is it possible that there is a second regression in the same push?
But it is really difficult to regression-test this, because I never know how many tries I should do before I see the bug, and it might be happening less often now. Any other way I could help?
I see the problem on a relative's Nexus 5, but I don't think I'm set up to debug this until I'm back in the office.  Probably the fix isn't complete in some way, or only fixes a simplification of the problem rather than the original.
OK, I just noticed a relatively obvious problem with the original patch; nsCSSRuleProcessor::CloneMQCacheKey doesn't handle cloning an mPreviousCacheKey.  That would make the bug still happen if the steps happened twice.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I've tested that this test fails with the current state of the tree
(i.e., without patch 10).
Attachment #8543521 - Flags: review?(cam)
Try run:
https://hg.mozilla.org/try/pushloghtml?changeset=dd7bec8ab1ba
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=dd7bec8ab1ba
https://tbpl.mozilla.org/?tree=Try&rev=dd7bec8ab1ba

I haven't tested that this actually fixes golem.de, but I think it's pretty likely that this fixes the bit that I missed.  I'll try to do that once the build finishes, assuming I can borrow the Nexus 5 again when it's done.  (If I can't, I'll at least link to the build.)
OK, the site is still broken even with that patch.  Not sure what else is wrong.

(Part of the problem here is that I'd tested that what I pushed in comment 24 against the actual site, but then tested the revisions I made to fix comment 25 only against the automated test that was fixed by the original patch in comment 24.)
Flags: needinfo?(dbaron)
I debugged a bit of what's going wrong on a relative's Nexus 5 (with logcat-printf debugging).

I think the remaining problem is that the Viewport Change (from 980x480 to 360x567) leads to nsPresContext::PostMediaFeatureValuesChangedEvent, but that event might not get processed immediately.  In the failing case, we get a GetRuleCascade call prior to the nsPresContext::HandleMediaFeatureValuesChangedEvent that encounters a null mRuleCascades and reconstructs them, with the *new* viewport size.  (In the success case, we actually end up doing a lot *more* work prior to the HandleMediaFeatureValuesChangedEvent, but we do the MediaFeatureValuesChanged call at about the same amount of work after the event, and the small number of GetRuleCascade() calls in-between don't involve having mRuleCascades null.)

We need to avoid doing things that trigger GetRuleCascade when nsPresContext::mPendingMediaFeatureValuesChanged is true.
Ideally, I'd like to use gdb to find out what the stack to that call of GetRuleCascade() is.  Without knowing that, I don't think I can figure out how to write an automated test for this final problem.  (I could try guessing what the stack is, but I haven't thought of anything that would do it yet.)  However, I don't know how to use gdb on a Nexus 5 that I've borrowed from somebody else in a way that I know I can restore the device to its original (unrooted) condition.
Actually, I suspect HasAttributeDependentStyle is a likely entry-point.  And I can check that with printf-debugging.
I was wrong, the call was coming through the 3-argument form of ProbePseudoElementStyle.

(In the case where things worked, the VIEWPORT_CHANGE is followed by a HasStateDependentStyle.)
I just spent an hour or two trying to write a mochitest for this remaining problem (comment 68 etc.), and I've been unable to do so.  Part of this because I don't fully understand the sequence of events since I was unable to get my hands on a device on which I could actually debug with a debugger.  I believe the sequence of events requires doing a resize (in a way that changes the width/height media queries) without immediately doing a reflow (which resizing an iframe does).  I think this is happening in the setup that Fennec does for the CSS viewport, but I'm not sure how to simulate in in a mochitest.

So I'm going to post the patch I wrote last week that I believe fixes the remaining problem (although I never got a chance to test it on a device), without a test.
I've put aside working on the remaining problem for a bit, since both approaches I tried resulted in lots of test failures.  But I'll hopefully get back to it after I finish bug 960465 and the related Transitions spec editing.

I also should figure out what, if anything, to uplift.
Comment on attachment 8543521 [details] [diff] [review]
patch 9 - Add mochitest that exercises case of clearing rule cascades twice

Review of attachment 8543521 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/style/test/test_bug1089417.html
@@ +26,5 @@
>         "media query change should have restyled");
> +
> +    f.height = "200";
> +    fdoc.getElementById("s").disabled = true;
> +    fdoc.getElementById("s").disabled = false;

I just want to clarify what happens here.

(1) The f.height assignment ends up calling nsCSSRuleProcessor::MediumFeaturesChanged, which at this point will have mRuleCascades, and thus will call RefreshRuleCascade which clears out mPreviousCacheKey and then generates a new mRuleCascades with a cache key for which HasFeatureConditions would return true (due to the @media rule in the file).

(2) The first disabled assignment calls ClearRuleCascades, which, seeing that there is no current mPreviousCacheKey, clones the current mRuleCascades->mCacheKey, then clears mRuleCascades.

(3) The second disabled assignment calls ClearRuleCascades, which, seeing that there *is* a current mPreviousCacheKey, leaves it alone, then clears mRuleCascades again (though it was already null).

That all seems right to me, so I must be missing something.  Is it that HasFeatureConditions on the new rule cascade's cache key in step (1) would return false?
(In reply to Cameron McCormack (:heycam) from comment #75)
> (1) The f.height assignment ends up calling
> nsCSSRuleProcessor::MediumFeaturesChanged, which at this point will have
> mRuleCascades, and thus will call RefreshRuleCascade which clears out
> mPreviousCacheKey and then generates a new mRuleCascades with a cache key
> for which HasFeatureConditions would return true (due to the @media rule in
> the file).

So the f.height assignment actually doesn't get processed until the flush, after the disabled sets, I believe.  (That was confusing; I think the test would work in either order.)

> (2) The first disabled assignment calls ClearRuleCascades, which, seeing
> that there is no current mPreviousCacheKey, clones the current
> mRuleCascades->mCacheKey, then clears mRuleCascades.

It doesn't go through ClearRuleCascades (well, it does, but unnecessarily, since CSSStyleSheet::SetEnabled does two different things that handle changes).  I think it goes through CSSStyleSheet::SetEnabled -> nsDocument::SetStyleSheetApplicableState -> nsDocument::AddStyleSheetToStyleSets -> nsStyleSet::AddDocStyleSheet -> nsStyleSet::DirtyRuleProcessors -> (maybe via EndUpdate) -> nsStyleSet::GatherRuleProcessors, which creates an entirely new nsCSSRuleProcessor, passing the old one to its constructor.

> (3) The second disabled assignment calls ClearRuleCascades, which, seeing
> that there *is* a current mPreviousCacheKey, leaves it alone, then clears
> mRuleCascades again (though it was already null).

This creates an entirely new nsCSSRuleProcessor again, calling nsCSSRuleProcessor::CloneMQCacheKey on the rule processor created in (2), which incorrectly fails to clone the key.
Flags: needinfo?(cam)
Ah, I see.  mRuleCascades is null on the rule processor created in (2), which is why it's failing to clone the cache key.
Flags: needinfo?(cam)
Attachment #8543521 - Flags: review?(cam) → review+
Comment on attachment 8543522 [details] [diff] [review]
patch 10 - Make CloneMQCacheKey clone the mPreviousCacheKey, to fix bug 1089417 for multiple rebuilds of the rule cascade in sequence

Review of attachment 8543522 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/style/nsCSSRuleProcessor.cpp
@@ +2960,5 @@
>    RuleCascadeData* c = mRuleCascades;
> +  if (!c) {
> +    // We might have an mPreviousCacheKey.  It already comes from a call
> +    // to CloneMQCacheKey, so don't bother checking
> +    // HasFeatureConditions().

By this do you mean that mPreviousCacheKey would definitely return true from HasFeatureConditions, since it's already gone through CloneMQCacheKey, and therefore we don't need to call it on mPreviousCacheKey to see whether we should return UniquePtr<nsMediaQueryResultCacheKey>() from here?  If so, is it worth asserting that HasFeatureConditions does return true?
Attachment #8543522 - Flags: review?(cam) → review+
(In reply to Cameron McCormack (:heycam) from comment #78)
> By this do you mean that mPreviousCacheKey would definitely return true from
> HasFeatureConditions, since it's already gone through CloneMQCacheKey, and
> therefore we don't need to call it on mPreviousCacheKey to see whether we
> should return UniquePtr<nsMediaQueryResultCacheKey>() from here?  If so, is
> it worth asserting that HasFeatureConditions does return true?

Added the assertion, although just as an NS_ASSERTION, since the only thing that depends on it is performance (and not seriously).


remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/f86a9dfe3be5
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/1e26b3a0731c
Keywords: leave-open
Can this be marked FIXED now?
David, do you think the two patches should also land in 36? It is still marked as affected. Thanks
(In reply to Mark Finkle (:mfinkle) from comment #81)
> Can this be marked FIXED now?

No; the original testcase still isn't fixed, although a number of problems that were leading to the breakage have been fixed.

(In reply to Sylvestre Ledru [:sylvestre] from comment #82)
> David, do you think the two patches should also land in 36? It is still
> marked as affected. Thanks

I don't think there's a lot of value, especially with the risk of landing things this late in the cycle.



Heycam did have a suggestion for how I could fix the remaining problem:  store the old rule cascade longer if it's rebuild for something other than a media feature value change.  I still need to get a chance to look in to this.
OK. Thanks, wontfix for 36 then.
Flags: needinfo?(dbaron)
David - It's been a month since the last comment in this bug. What's your plan for following up to resolve this bug? Should we apple the same logic as stated in comment 83 and mark this as wontfix for 37?
Flags: needinfo?(dbaron)
Yes.  I think this probably isn't the sort of regression that we need to track.  A change exposed an existing bug on an additional site, but there don't seem to have been other sites affected that we're aware of.

We should fix that existing bug, but it's not top priority.
Flags: needinfo?(dbaron)
Flags: needinfo?(dbaron)
For other thoughts on the issue in this bug, see bug 77999 comment 41 and the prior comments that it references.
tracking-fennec: 35+ → ?
tracking-fennec: ? → +
Status: REOPENED → RESOLVED
Closed: 9 years ago6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.