Closed Bug 1241634 Opened 8 years ago Closed 8 years ago

Intermittent test_frameNavigation.html | navigating to insecure grandchild iframe blocked on insecure page

Categories

(Core :: DOM: Security, defect)

46 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox45 --- disabled
firefox46 --- disabled
firefox47 --- disabled
firefox48 --- disabled
firefox-esr45 --- disabled

People

(Reporter: KWierso, Assigned: jhao, Mentored)

References

Details

(Keywords: intermittent-failure, Whiteboard: [rr-chaos][test disabled on OSX and Windows] [tpe-seceng], [domsecurity-intermittent])

Attachments

(3 files, 6 obsolete files)

Possibly only on Windows PGO, or at least, that's all I've seen this on so far.

https://treeherder.mozilla.org/logviewer.html#?job_id=6747940&repo=fx-team
I believe Kate was looking at this test recently.
Flags: needinfo?(kmckinley)
Can you try this with the patch from 1213633 applied? It doesn't appear to be included in either version that was pushed to treeherder.
Flags: needinfo?(kmckinley) → needinfo?(wkocher)
This was filed after bug 1213633 landed (i.e. given the timing, I'm pretty sure it's a regression from that). Also, these are failures in our CI infrastructure, not local test failures, so testing with a patch applied isn't something that can be done except via Try pushes.
Flags: needinfo?(wkocher)
Assignee: nobody → kmckinley
Status: NEW → ASSIGNED
Comment on attachment 8711538 [details] [diff] [review]
Text should match the expected string

Hi Kate,

I'm not sure what text you are trying to match here.
Attachment #8711538 - Flags: review?(tanvi)
(In reply to Tanvi Vyas [:tanvi] from comment #9)
> Comment on attachment 8711538 [details] [diff] [review]
> Text should match the expected string
> 
> Hi Kate,
> 
> I'm not sure what text you are trying to match here.

See https://dxr.mozilla.org/mozilla-central/source/dom/security/test/mixedcontentblocker/test_frameNavigation.html?case=true&from=test_frameNavigation.html#98

It looks like the message is expecting "navigated", but is being given "navigating".
Hi Kate,

I'm not seeing the issue here.  The message is expecting "navigated to insecure grandchild iframe on insecure page".  But the message changed in the patch is for "navigated/navigating to insecure grandchild iframe BLOCKED on insecure page".

1) https://dxr.mozilla.org/mozilla-central/source/dom/security/test/mixedcontentblocker/test_frameNavigation.html?case=true&from=test_frameNavigation.html#98

ok((event.data.msg == "navigated to insecure grandchild iframe on insecure page"), "navigating to insecure grandchild iframe blocked on insecure page");

This code looks for the msg "navigated to insecure grandchild iframe on insecure page".  If it doesn't find that message (i.e. the navigation fails), it fails with output "navigating to insecure grandchild iframe blocked on insecure page".

2) https://dxr.mozilla.org/mozilla-central/source/dom/security/test/mixedcontentblocker/file_frameNavigation_innermost.html#28

This code is called when the navigation of the insecure grandchild iframe succeeds on the insecure page.  It sends a post message msg "navigated to insecure grandchild iframe on insecure page" which matches the string in the ok() method above.  This will cause the test to succeed.

3) https://dxr.mozilla.org/mozilla-central/source/dom/security/test/mixedcontentblocker/file_frameNavigation_grandchild.html?from=file_frameNavigation_grandchild.html#46

The code here is called when navigation is blocked.  It sends a post message "navigating to insecure grandchild iframe blocked on insecure page", which doesn't match the msg in the ok() method above, and hence we get a failure with output taken from the second parameter of ok().  That output happens to match what was sent in the post message, but doesn't need to match it since we don't need to compare the second parameter of ok().



The test should succeed and we should go through the codepath in 2), since we should be able to navigate from an HTTPS location to an HTTP location on a frame inside an insecure/HTTP document.
Attachment #8711538 - Attachment is obsolete: true
Attachment #8716834 - Flags: review?(tanvi)
Comment on attachment 8716834 [details] [diff] [review]
Increase timeout for test to prevent false-positives

Since the test is synchronous, lets increase the max count instead of the timeout interval.
Attachment #8716834 - Flags: review?(tanvi) → review-
Attachment #8716834 - Attachment is obsolete: true
Attachment #8717262 - Flags: review?(tanvi)
Comment on attachment 8717262 [details] [diff] [review]
Increase timeout for test to prevent false-positives

   // For tests that require setTimeout, set the maximum polling time to 50 x 100ms = 5 seconds.
-  var MAX_COUNT = 50;
+  var MAX_COUNT = 100;
Please also update the comment.
Attachment #8717262 - Flags: review?(tanvi) → review+
https://hg.mozilla.org/mozilla-central/rev/86c439c29ae3
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Alas, doesn't seem to be actually fixed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I caught this in rr today, will look at it tomorrow.
Whiteboard: [rr-chaos]
What I caught in rr appears to be the networking threads being starved past the (updated) timeout, so idk that it's that helpful.
This missed the 45 boat (though hopefully something can still be done for esr45 given the high frequency).
This bug isn't going anywhere. We should skip the test if nobody has time to look into the failures.
Flags: needinfo?(kmckinley)
Disabled for now as it's a highly-frequent failure. Happy to re-enable it once the underlying issue is fixed.
https://hg.mozilla.org/integration/mozilla-inbound/rev/1b13d13b8cc5
Flags: needinfo?(kmckinley)
Whiteboard: [rr-chaos] → [rr-chaos][test disabled on OSX and Windows]
Target Milestone: mozilla47 → ---
Assign Jonathan to take this bug.
Assignee: kmckinley → jhao
Mentor: kmckinley
Whiteboard: [rr-chaos][test disabled on OSX and Windows] → [rr-chaos][test disabled on OSX and Windows] [tpe-seceng]
I tried to increase the timeout on my Macbook Pro. The reproduce rate did drop at first, but never fell to zero, no matter how long the timeout was. I'm still trying to figure out what's wrong.
Per https://bugzilla.mozilla.org/show_bug.cgi?id=1241634#c33 the problem doesn't seem to be the test itself, but a networking threat that is paused. It might be best to use rr-chaos to to find a reproduction, and see what the thread is waiting on.
Whiteboard: [rr-chaos][test disabled on OSX and Windows] [tpe-seceng] → [rr-chaos][test disabled on OSX and Windows] [tpe-seceng], [domsecurity-intermittent]
This is not even a WIP patch. I just use this to comment out other tests and narrow down possible causes.
According to Tanvi's comment (https://bugzilla.mozilla.org/show_bug.cgi?id=840388#c42), this test does the following things:

> 1) Navigate a child https page to an http page when it is embedded in an http page.
> 2) Navigate a grandchild https page to an http page when it is embedded in http pages.
> 3) Navigate a child https page to an http page when it is embedded in an https page.
> 4) Navigate a grandchild https page to an http page when it is embedded in an https parent and has an https grandparent.

> These 4 tests are run with security.mixed_content.block_active_content set to true, and then run again when the pref is set to false.

When the test fails, it always happens in case 2) in the second round when the security.mixed_content.block_active_content is set to false.

I reproduced it with rr chaos mode, but still have no clue where to start looking for bugs. So, I tried to narrow down the problem by commenting out 1), 3), and 4) and only run once with security.mixed_content.block_active_content is false, as in the patch above. Then, I run the modified test on my MacBook Pro.

Here are some observations on the outcome:
1. If I run this test many times individually (not by --run-until-failure), I can't reproduce the failure.
2. If I run this test by "--run-until-failure", I can always reproduce the failure on either the second, third, fourth tries, but never on the first try.

Note that the preference "security.mixed_content.block_active_content" is turned off the whole time, and the test still fails, so the bug probably has little to do with the mixed content blocking code.

I think that the reason that the test fails is because we reuse the same iframe more than once, which could explain the above observations and why the failure always happens in the second round. Nonetheless, I still have no idea why reusing a grandchild iframe will make it blocked, which is what I will try to find out later.
I tried using two different iframes and also tried creating and deleting the iframe in each tests, but still can't work around the problem.

I'll have to look deeper to find out what's wrong.
I discussed this with Henry, Ethan, and Xidorn. It's still unclear why the failure happens, but we found a way to avoid it, by setting a 0-millisecond timeout before clicking.

Hi Kate,

Do you think this fix is enough, for now, to resolve this bug, so that we can re-enable this test? We can file a follow-up bug about the real issue behind.
Attachment #8734232 - Flags: feedback?(kmckinley)
(In reply to Jonathan Hao [:jhao] from comment #62)
> Created attachment 8734232 [details] [diff] [review]
> Set timeout before clicking in mixcontentblocker test
> 
> I discussed this with Henry, Ethan, and Xidorn. It's still unclear why the
> failure happens, but we found a way to avoid it, by setting a 0-millisecond
> timeout before clicking.
> 
> Hi Kate,
> 
> Do you think this fix is enough, for now, to resolve this bug, so that we
> can re-enable this test? We can file a follow-up bug about the real issue
> behind.

Strange.  How does this avoid the failure?  

If try looks good then we can use this to re-enable the test, and file a followup to continue digging for the real issue.
I forgot to enable the test in the earlier try push. Trying again now.
Attachment #8734232 - Attachment is obsolete: true
Attachment #8734232 - Flags: feedback?(kmckinley)
Comment on attachment 8734292 [details] [diff] [review]
Set timeout before clicking in mixcontentblocker test

(In reply to Tanvi Vyas - please needinfo [:tanvi] from comment #64)
> Strange.  How does this avoid the failure?  
> 
> If try looks good then we can use this to re-enable the test, and file a
> followup to continue digging for the real issue.

Thanks, Tanvi.

Try does look good. May I ask you or Kate to review this patch?
Attachment #8734292 - Flags: review?(tanvi)
Attachment #8734292 - Flags: review?(kmckinley)
(In reply to Tanvi Vyas - please needinfo [:tanvi] from comment #64)
> > I discussed this with Henry, Ethan, and Xidorn. It's still unclear why the
> > failure happens, but we found a way to avoid it, by setting a 0-millisecond
> > timeout before clicking.
> Strange.  How does this avoid the failure?  

It's good to hear we found a way to circumvent the test failure.

But, Jonathan, can you explain more detail on why this work-around works?
It will be helpful if you add comments in the code as well.
Flags: needinfo?(jhao)
I'm not sure right now why it works. I guess there's something wrong with setting innerHTML.

http://stackoverflow.com/questions/16776606/innerhtml-cant-be-trusted-does-not-always-execute-synchronously

Perhaps I should try writing the test with appendChild, which may seem less hacky than using setTimeout.
Flags: needinfo?(jhao)
See Also: → 1259715
I tried using appendChild instead of innerHTML. The results are the same.
Added comments.

MozReview-Commit-ID: 5rbeuVjaw0B
Attachment #8734696 - Flags: review?(tanvi)
Attachment #8734696 - Flags: review?(kmckinley)
Attachment #8734292 - Attachment is obsolete: true
Attachment #8734292 - Flags: review?(tanvi)
Attachment #8734292 - Flags: review?(kmckinley)
Comment on attachment 8734696 [details] [diff] [review]
Set 0-ms timeout before clicking in mixedcontentblocker test

Review of attachment 8734696 [details] [diff] [review]:
-----------------------------------------------------------------

I wonder why we don't see intermittent failure for the other link clicks in this test.

Can you reduce the timeout back to 5 seconds?
Attachment #8734696 - Flags: review?(kmckinley) → review+
Ehsan, do you have any thoughts about comment 62 and onwards about what might be going on here that a setTimeout is wallpapering over?
Flags: needinfo?(ehsan)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #73)
> Ehsan, do you have any thoughts about comment 62 and onwards about what
> might be going on here that a setTimeout is wallpapering over?

I don't know, because I haven't debugged this, but it should be very easy to debug and see what's going on by adding printf statements in places such as nsContentUtils::TriggerLink() and nsDocShell::OnLinkClick() to see what's causing an early bailout.  This has a great chance of being an actual bug, so I don't think we should work around it as this patch does without getting to the bottom of it.
Flags: needinfo?(ehsan)
Comment on attachment 8734696 [details] [diff] [review]
Set 0-ms timeout before clicking in mixedcontentblocker test

r- based on comment 74.
Attachment #8734696 - Flags: review?(tanvi) → review-
(In reply to :Ehsan Akhgari from comment #74)
> (In reply to Ryan VanderMeulen [:RyanVM] from comment #73)
> > Ehsan, do you have any thoughts about comment 62 and onwards about what
> > might be going on here that a setTimeout is wallpapering over?
> 
> I don't know, because I haven't debugged this, but it should be very easy to
> debug and see what's going on by adding printf statements in places such as
> nsContentUtils::TriggerLink() and nsDocShell::OnLinkClick() to see what's
> causing an early bailout.  This has a great chance of being an actual bug,
> so I don't think we should work around it as this patch does without getting
> to the bottom of it.

Thanks, Ehsan. I added logs in TriggerLink() and OnLinkClick() as you suggested. I found that when failure happens, the TriggerLink() is not called at all. It seems the bug occurs at an earlier stage.
Henry and I investigate further. We found that when the click isn't handled, CheckHandleEventForAnchorsPreconditions() returns false because mPresContext is null[1]. The reason is that in nsGenericHTMLElement::Click(), the doc->GetShell() returns null[2]. Sometimes the initialization of mPresShell happens after we click.

We asked our colleagues in the layout team. They told us to add a getBoundingClientRect() call before clicking, and it works, too. They said maybe the click fails because the document hasn't been reflowed.

Hi, Ehsan. Do you think it's a bug? Is it OK to accept that a reflow is necessary?

Thanks.

[1] https://dxr.mozilla.org/mozilla-central/source/dom/html/nsGenericHTMLElement.cpp#658
[2] https://dxr.mozilla.org/mozilla-central/source/dom/html/nsGenericHTMLElement.cpp#652
Flags: needinfo?(ehsan)
(In reply to Jonathan Hao [:jhao] from comment #78)
> Henry and I investigate further. We found that when the click isn't handled,
> CheckHandleEventForAnchorsPreconditions() returns false because mPresContext
> is null[1]. The reason is that in nsGenericHTMLElement::Click(), the
> doc->GetShell() returns null[2]. Sometimes the initialization of mPresShell
> happens after we click.

OK, that makes sense.  Thanks for the investigation.

> We asked our colleagues in the layout team. They told us to add a
> getBoundingClientRect() call before clicking, and it works, too. They said
> maybe the click fails because the document hasn't been reflowed.
> 
> Hi, Ehsan. Do you think it's a bug? Is it OK to accept that a reflow is
> necessary?

Hmm, that would make the test work, but I think Click() can still be broken here since we're using stale layout information.  And this is a bug that a web page can also run into, so I think it's better to consider flushing inside Click() before doing anything else.  But I'd like to get a second opinion from Boris.

Boris, do you think that makes sense?
Flags: needinfo?(ehsan) → needinfo?(bzbarsky)
Flushing in Click() to get updated layout info doesn't make too much: if event listeners then go and rearrange things you can have "stale" layout information anyway.

If we're not going to move ownership of the ESM over to the document (such that CheckHandleEventForAnchorsPreconditions would have no prescontext dependency; see also discussion in bug 919297), then it might make sense to FlushFrames in Click().  Olli, do you see a problem with that?
Flags: needinfo?(bzbarsky) → needinfo?(bugs)
Yes, we should move ESM to be bound to documents which have browsing context or something and not depend on layout.

But before that we could flush frames in Click() yes. Though, that is a hack, given that
dispatchEvent(new MouseEvent("click")); is supposed to have the same behavior.

I wonder if we just want to fix the test for now, use requestAnimationFrame + setTimeout( ..., 0)
to trigger click();
Flags: needinfo?(bugs)
(In reply to Olli Pettay [:smaug] from comment #81)
> Yes, we should move ESM to be bound to documents which have browsing context
> or something and not depend on layout.
> 
> But before that we could flush frames in Click() yes. Though, that is a
> hack, given that
> dispatchEvent(new MouseEvent("click")); is supposed to have the same
> behavior.
> 
> I wonder if we just want to fix the test for now, use requestAnimationFrame
> + setTimeout( ..., 0)
> to trigger click();

Thanks, Olli. I want to make sure I get what you mean because I'm not very familiar with requestAnimationFrame. Do you mean something like this?

windows.requestAnimationFrame(() => setTimeout(() => link.click(), 0))
Flags: needinfo?(bugs)
Ehsan, given the above information by Boris and Olli, do you think it's OK that we fix the test for now?
Flags: needinfo?(ehsan)
I mean requestAnimationFrame(function() { setTimeout(function() {link.click() } , 0); });
...which I guess is the same what you said :)

frames and layout are flushed right after animation frame callbacks have been called, so calling
click() asap after that is close to best we can do here. (and no need to explicitly force a layout flush).

(I wish the platform provided better APIs here, something like HTML spec bug https://www.w3.org/Bugs/Public/show_bug.cgi?id=28644)
Flags: needinfo?(bugs)
(In reply to Jonathan Hao [:jhao] from comment #83)
> Ehsan, given the above information by Boris and Olli, do you think it's OK
> that we fix the test for now?

No objections from me!
Flags: needinfo?(ehsan)
Thanks to Ehsan, Boris, and Olli. Then I'll fix the test as Olli suggested.

Hi Kate, since Tanvi's on PTO, would you review again for me? Thank you.
Attachment #8734696 - Attachment is obsolete: true
Attachment #8738382 - Flags: review?(kmckinley)
Comment on attachment 8738382 [details]
MozReview Request: Bug 1241634 - Reflow before clicking in mixedcontentblocker test r?kmckinley

https://reviewboard.mozilla.org/r/44479/#review41751

Looks good
Comment on attachment 8738382 [details]
MozReview Request: Bug 1241634 - Reflow before clicking in mixedcontentblocker test r?kmckinley

https://reviewboard.mozilla.org/r/44479/#review41753
Attachment #8738382 - Flags: review+
Keywords: checkin-needed
Thanks, Kate :)
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: