Closed Bug 1155014 Opened 9 years ago Closed 5 years ago

PageMetaData returns stale data for pages using history.pushState

Categories

(Toolkit :: General, defect, P3)

defect

Tracking

()

RESOLVED INVALID

People

(Reporter: markh, Unassigned)

References

Details

Attachments

(2 files, 1 obsolete file)

See bug 1154638 for a STR, but the short story is that some pages manage to use history.pushState() to trick the browser into thinking it is on a different page than it actually is - ie, the majority of the DOM reflects a previous page.

PageMegaData.jsm will return incorrect data in this case (eg, in the dependent bugs, the metadata for the incorrect youtube/vimeo video is grabbed.

This impacts all consumers of this module, including social, loop and readinglist.

It's possible nsDocument::GetStateObject can tell us if that is the case, although what to do about it is less clear (eg, readinglist might decline to trust any metadata and fetch it in the background, but it's not clear social and loop will be able to live with that latency). Reloading the document would also work, but that's going to suck for users actually watching the video for example.
Seems like an important fix for a class of bugs, but since this already shipped and affects Hello/Social, feels like it can't be a P2.
Priority: -- → P3
In the youtube case, document.documentURI (which PageMetadata uses) is the original url, and window.location.href is the new url.  Would that always be consistent?  If so, we could do something like:

if documentURI == location.href then
  gather other page meta data
else
  only return href and title

In the second case, a service would be required to fetch in order to get anything beyond the url.

any [browser] history buff we could ask about the documentURI/location stuff?
Yahoo pages also share this issue, having two examples I think we could do this:

if windows.history.state then
  return window.location.href and title
else
  do the larger data sweep of the page and return that data

opinions?
Flags: needinfo?(markh)
Flags: needinfo?(dolske)
That doesn't work in my vimeo STR from bug 1154638. Specifically, after switching to a new video:

> document.documentURI
> "https://vimeo.com/channels/staffpicks/122437773"

> document.location.href
> "https://vimeo.com/channels/staffpicks/122437773"

So the above are showing the "new" URL as shown in the address bar.

> window.history.state
> null

window.history.state is null

> document.querySelector('head > link[rel="canonical"]').getAttribute("href")
"/118972727"

But the DOM still reflects the previous page (/118972727 is where I started, /122437773 is the new video I switched to)
Flags: needinfo?(markh)
Flags: needinfo?(dolske)
(In reply to Shane Caraveo (:mixedpuppy) from comment #3)
> if windows.history.state then
>   return window.location.href and title
> else
>   do the larger data sweep of the page and return that data

Also, assuming we can get the above condition right, ISTM the bigger problem is that "larger sweep" actually means - I believe it will mean a full load of the page, probably via something like the background thumbnail service. This is probably a fair bit of work and introduces enough latency that I wonder if it will still leave some of social effectively broken from the user's POV (ie, would we prevent the user from posting the page info to a provider until it came back? Or would we let them submit without that metadata?)

But I don't mean to sound like a downer - we need to do *something* :)
(In reply to Mark Hammond [:markh] from comment #5)
> (In reply to Shane Caraveo (:mixedpuppy) from comment #3)
> > if windows.history.state then
> >   return window.location.href and title
> > else
> >   do the larger data sweep of the page and return that data
> 
> Also, assuming we can get the above condition right, ISTM the bigger problem
> is that "larger sweep" actually means - I believe it will mean a full load
> of the page, probably via something like the background thumbnail service.

no, that's backwards.  The "larger sweep" is what happens now.  I'm suggesting we short cut that if state is not null, only returning items we know are in a valid state.  Most if not all services are doing their own stuff on the backend.
(In reply to Shane Caraveo (:mixedpuppy) from comment #6)
> no, that's backwards.  The "larger sweep" is what happens now.  I'm
> suggesting we short cut that if state is not null, only returning items we
> know are in a valid state.  Most if not all services are doing their own
> stuff on the backend.

Oh, right, sorry. For social that sounds reasonable if we can work out the condition to check for - ie, social would just decline to use PageMetaData.jsm when it detects that case.

OTOH though, if that's true, why not unconditionally return just the url and the title now?
(In reply to Mark Hammond [:markh] from comment #7)
> (In reply to Shane Caraveo (:mixedpuppy) from comment #6)
> > no, that's backwards.  The "larger sweep" is what happens now.  I'm
> > suggesting we short cut that if state is not null, only returning items we
> > know are in a valid state.  Most if not all services are doing their own
> > stuff on the backend.
> 
> Oh, right, sorry. For social that sounds reasonable if we can work out the
> condition to check for - ie, social would just decline to use
> PageMetaData.jsm when it detects that case.
> 
> OTOH though, if that's true, why not unconditionally return just the url and
> the title now?

There are other projects that will the metadata, just not the major social networks.
Was working on a patch that I thought would fix the problem, and discovered this may be a bigger issue than just PageMetadata.jsm.  The meta data scanning is done in content.js.  What I have found is, document.documentURI inside of content.js is the original url, whereas in chrome, gBrowser.currentURI is the correct new url, and in the web console document.documentURI has the currect url.

STR 

visit a youtube video:

1. https://www.youtube.com/watch?v=uAdDUD6L1nc
2. click on any other youtube video linked to in the page

- In the web console, document.documentURI will be correct
- In the browser console, gBrowser.currentURI will be correct
- In content.js, content.document.documentURI has the first url loaded 
   I had to add a log statement at the beginning of PageMetadata:GetPageData to verify this
   console.log("c.doc.documentURI: ",content.document.documentURI);


ISTM that content.document is not updated in frame scripts whereas other locations do get updated.

I tried this in a non-e10s window to verify it is not an e10s issue.
Attached file browser_pushstate.js (obsolete) —
This test fails, illustrating the problem (runs in browser/base/content/test/general)
Attached patch pushstateSplinter Review
a full patch is probably easier to try
Attachment #8616264 - Attachment is obsolete: true
Attached patch fix pagedata.urlSplinter Review
Assuming that the problem illustrated by the previous patch is addressed, this patch would fix the rest of the problem in PageMetadata.jsm by making "url" always be documentURI rather than overwriting with some metadata.  Note that og:url does get saved, the switch statement there is a special case handler.
Attachment #8616270 - Flags: feedback?(markh)
I don't understand how this will help with things extracted from things other than the document URL (eg, for things we walk the result of querySelector etc)
Nomming this because it looks like the page URI's can get out of sync between the parent process and content scripts.

mixedpuppy has a failing testcase in attachment 8616266 [details] [diff] [review].
tracking-e10s: --- → ?
De-nomming - I assumed this was only reproducible with e10s, and that's not the case.
tracking-e10s: ? → ---
Hey smaug, any idea where we might be going out of sync here? See the testcase in attachment 8616266 [details] [diff] [review] - the last is will fail.
Flags: needinfo?(bugs)
Hey Mike/Smaug,
 To be clear - the described behaviour for most of this bug isn't actually a bug - it's a totally expected side-effect of the pages using the history API. This bug was really about the fact that we have a module used to scrape page metadata, but that module doesn't take the possibility of pages using the history API into account and thus will return metadata for a completely different page.

Comment 9 is something that could do with an explanation, but that alone isn't enough to "fix" this bug.  I'm not even sure we *can* fix this bug other than to detect when a page is in a strange state due to use of the history API and refuse to return stale data. When Social is a consumer, not returning stale data might be good enough (as the rest of the metadata can be considered optional). However, for other consumers (such as the defunct readinglist integration), such partial data may not have been good enough and we may have been forced to perform a background load of the URL to fetch accurate data.

In short, I think we need a separate bug for the issue described in comment 9.
it is not now clear to me if ni? is still valid.
Flags: needinfo?(bugs)
Smaug, ni for the problem reported in comment 9 would be valuable.
Flags: needinfo?(bugs)
Is there some minimal example of the issue? It isn't clear to me what 
PageMetadata:GetPageData does, or when it is triggered - in particular, when does the parent send the relevant message?
It is obviously called async, so that adds one error point.
Flags: needinfo?(bugs)
Shane, I think the best way forward is probably to open a new bug that's unrelated to PageMetaData.jsm but which demonstrates comment 9 (and which still blocks this bug). You might be able to use attachment 8592688 [details] as a template for a content script you can execute directly from a scratchpad to demonstrate the problem without requiring changes to any source files.
Flags: needinfo?(mixedpuppy)
(In reply to Olli Pettay [:smaug] (for generic DOM reviews, you may want to look for other reviewers too ;)) from comment #20)
> Is there some minimal example of the issue? It isn't clear to me what 
> PageMetadata:GetPageData does, or when it is triggered - in particular, when
> does the parent send the relevant message?
> It is obviously called async, so that adds one error point.

The stuff that we wanted your input on was actually the minimal testcase in attachment 8616266 [details] [diff] [review].

The last "is" is failing - it appears that in content scripts, the content.document.documentURI.spec does not match the gBrowser.currentURI.spec after changing the URI via content.history.pushState.
Flags: needinfo?(bugs)
Depends on: 1173215
I've created bug 1173215 for the framescript problem
Flags: needinfo?(mixedpuppy)
(gBrowser.currentURI refers to the URI docshell knows about and content.document.documentURI is about, well, the document URI.)

Anyhow, what is content.document.documentURI right after content.history.pushState({}, "2", "2.html"); ?

Does the testcase lead us to start loading "http://example.com", but before the load is ready,
pushState() is called, and then the page loading finishes, and document.documentURI is set back to
"http://example.com"

(I don't know how BrowserTestUtils.openNewForegroundTab behaves.)
Flags: needinfo?(bugs)
a web page with the following code works as expected showing code inside the web page has the correct url.  (note this cannot run as a file: url)

window.onload = function() {
    alert(document.documentURI); // original url
    history.pushState({}, "2", "2.html");
    alert(document.documentURI); // new url from pushState
}
(In reply to Mark Hammond [:markh] from comment #13)
> I don't understand how this will help with things extracted from things
> other than the document URL (eg, for things we walk the result of
> querySelector etc)

Now that bug 1173215 has moved forward, want to revisit this.  

There's a lot of *things* to unwrap in that comment. :)

The problem is that we don't have a solid way to know, even if history state exists, what the correct thing to do here is.  Even without history state, og:url, canonicalUrl and documentURI could all be different, so I cannot rely on comparing any of those.  With history state, we don't know if they updated those dom elements, and I don't know there is something I can look at to know the url changed.  If there were a solid reliable way to know how to detect incorrect information, I would skip the scan.  Otherwise, this is the smallest change necessary to properly use documentURI in the case where pushState has changed the url.

Input welcome, want to land a fix this week so I can uplift to aurora (assuming bug 1173215 gets uplifted)
Depends on: 1175306
Comment on attachment 8616270 [details] [diff] [review]
fix pagedata.url

The basic fix (getting correct url) is being handled in bug 1175306
Attachment #8616270 - Flags: feedback?(markh)
Blocks: 1141782

Removed entirely in bug 1503674.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: