Bugzilla

Comment 2

•

9 years ago

In the youtube case, document.documentURI (which PageMetadata uses) is the original url, and window.location.href is the new url.  Would that always be consistent?  If so, we could do something like:

if documentURI == location.href then
  gather other page meta data
else
  only return href and title

In the second case, a service would be required to fetch in order to get anything beyond the url.

any [browser] history buff we could ask about the documentURI/location stuff?

Comment 3

•

9 years ago

Yahoo pages also share this issue, having two examples I think we could do this:

if windows.history.state then
  return window.location.href and title
else
  do the larger data sweep of the page and return that data

opinions?

Flags: needinfo?(markh)

Flags: needinfo?(dolske)

Reporter

Comment 4

•

9 years ago

That doesn't work in my vimeo STR from bug 1154638. Specifically, after switching to a new video:

> document.documentURI
> "https://vimeo.com/channels/staffpicks/122437773"

> document.location.href
> "https://vimeo.com/channels/staffpicks/122437773"

So the above are showing the "new" URL as shown in the address bar.

> window.history.state
> null

window.history.state is null

> document.querySelector('head > link[rel="canonical"]').getAttribute("href")
"/118972727"

But the DOM still reflects the previous page (/118972727 is where I started, /122437773 is the new video I switched to)

Flags: needinfo?(markh)

Flags: needinfo?(dolske)

Reporter

Comment 5

•

9 years ago

(In reply to Shane Caraveo (:mixedpuppy) from comment #3)
> if windows.history.state then
>   return window.location.href and title
> else
>   do the larger data sweep of the page and return that data

Also, assuming we can get the above condition right, ISTM the bigger problem is that "larger sweep" actually means - I believe it will mean a full load of the page, probably via something like the background thumbnail service. This is probably a fair bit of work and introduces enough latency that I wonder if it will still leave some of social effectively broken from the user's POV (ie, would we prevent the user from posting the page info to a provider until it came back? Or would we let them submit without that metadata?)

But I don't mean to sound like a downer - we need to do *something* :)

Comment 6

•

9 years ago

(In reply to Mark Hammond [:markh] from comment #5)
> (In reply to Shane Caraveo (:mixedpuppy) from comment #3)
> > if windows.history.state then
> >   return window.location.href and title
> > else
> >   do the larger data sweep of the page and return that data
> 
> Also, assuming we can get the above condition right, ISTM the bigger problem
> is that "larger sweep" actually means - I believe it will mean a full load
> of the page, probably via something like the background thumbnail service.

no, that's backwards.  The "larger sweep" is what happens now.  I'm suggesting we short cut that if state is not null, only returning items we know are in a valid state.  Most if not all services are doing their own stuff on the backend.

Reporter

Comment 7

•

9 years ago

(In reply to Shane Caraveo (:mixedpuppy) from comment #6)
> no, that's backwards.  The "larger sweep" is what happens now.  I'm
> suggesting we short cut that if state is not null, only returning items we
> know are in a valid state.  Most if not all services are doing their own
> stuff on the backend.

Oh, right, sorry. For social that sounds reasonable if we can work out the condition to check for - ie, social would just decline to use PageMetaData.jsm when it detects that case.

OTOH though, if that's true, why not unconditionally return just the url and the title now?

Comment 8

•

9 years ago

(In reply to Mark Hammond [:markh] from comment #7)
> (In reply to Shane Caraveo (:mixedpuppy) from comment #6)
> > no, that's backwards.  The "larger sweep" is what happens now.  I'm
> > suggesting we short cut that if state is not null, only returning items we
> > know are in a valid state.  Most if not all services are doing their own
> > stuff on the backend.
> 
> Oh, right, sorry. For social that sounds reasonable if we can work out the
> condition to check for - ie, social would just decline to use
> PageMetaData.jsm when it detects that case.
> 
> OTOH though, if that's true, why not unconditionally return just the url and
> the title now?

There are other projects that will the metadata, just not the major social networks.

Comment 9

•

9 years ago

Was working on a patch that I thought would fix the problem, and discovered this may be a bigger issue than just PageMetadata.jsm.  The meta data scanning is done in content.js.  What I have found is, document.documentURI inside of content.js is the original url, whereas in chrome, gBrowser.currentURI is the correct new url, and in the web console document.documentURI has the currect url.

STR 

visit a youtube video:

1. https://www.youtube.com/watch?v=uAdDUD6L1nc
2. click on any other youtube video linked to in the page

- In the web console, document.documentURI will be correct
- In the browser console, gBrowser.currentURI will be correct
- In content.js, content.document.documentURI has the first url loaded 
   I had to add a log statement at the beginning of PageMetadata:GetPageData to verify this
   console.log("c.doc.documentURI: ",content.document.documentURI);


ISTM that content.document is not updated in frame scripts whereas other locations do get updated.

I tried this in a non-e10s window to verify it is not an e10s issue.

Comment 10

•

9 years ago

Attached file browser_pushstate.js (obsolete) — Details

This test fails, illustrating the problem (runs in browser/base/content/test/general)

Comment 11

•

9 years ago

Attached patch pushstate — Details — Splinter Review

a full patch is probably easier to try

Attachment #8616264 - Attachment is obsolete: true

Comment 12

•

9 years ago

Attached patch fix pagedata.url — Details — Splinter Review

Assuming that the problem illustrated by the previous patch is addressed, this patch would fix the rest of the problem in PageMetadata.jsm by making "url" always be documentURI rather than overwriting with some metadata.  Note that og:url does get saved, the switch statement there is a special case handler.

Attachment #8616270 - Flags: feedback?(markh)

Reporter

Comment 13

•

9 years ago

I don't understand how this will help with things extracted from things other than the document URL (eg, for things we walk the result of querySelector etc)

Comment 14

•

9 years ago

Nomming this because it looks like the page URI's can get out of sync between the parent process and content scripts.

mixedpuppy has a failing testcase in attachment 8616266 [details] [diff] [review].

tracking-e10s: --- → ?

Comment 15

•

9 years ago

De-nomming - I assumed this was only reproducible with e10s, and that's not the case.

tracking-e10s: ? → ---

Comment 16

•

9 years ago

Hey smaug, any idea where we might be going out of sync here? See the testcase in attachment 8616266 [details] [diff] [review] - the last is will fail.

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 17

•

9 years ago

Hey Mike/Smaug,
 To be clear - the described behaviour for most of this bug isn't actually a bug - it's a totally expected side-effect of the pages using the history API. This bug was really about the fact that we have a module used to scrape page metadata, but that module doesn't take the possibility of pages using the history API into account and thus will return metadata for a completely different page.

Comment 9 is something that could do with an explanation, but that alone isn't enough to "fix" this bug.  I'm not even sure we *can* fix this bug other than to detect when a page is in a strange state due to use of the history API and refuse to return stale data. When Social is a consumer, not returning stale data might be good enough (as the rest of the metadata can be considered optional). However, for other consumers (such as the defunct readinglist integration), such partial data may not have been good enough and we may have been forced to perform a background load of the URL to fetch accurate data.

In short, I think we need a separate bug for the issue described in comment 9.

Comment 18

•

9 years ago

it is not now clear to me if ni? is still valid.

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 19

•

9 years ago

Smaug, ni for the problem reported in comment 9 would be valuable.

Flags: needinfo?(bugs)

Comment 20

•

9 years ago

Is there some minimal example of the issue? It isn't clear to me what 
PageMetadata:GetPageData does, or when it is triggered - in particular, when does the parent send the relevant message?
It is obviously called async, so that adds one error point.

Flags: needinfo?(bugs)

Reporter

Comment 21

•

9 years ago

Shane, I think the best way forward is probably to open a new bug that's unrelated to PageMetaData.jsm but which demonstrates comment 9 (and which still blocks this bug). You might be able to use attachment 8592688 [details] as a template for a content script you can execute directly from a scratchpad to demonstrate the problem without requiring changes to any source files.

Flags: needinfo?(mixedpuppy)

Comment 22

•

9 years ago

(In reply to Olli Pettay [:smaug] (for generic DOM reviews, you may want to look for other reviewers too ;)) from comment #20)
> Is there some minimal example of the issue? It isn't clear to me what 
> PageMetadata:GetPageData does, or when it is triggered - in particular, when
> does the parent send the relevant message?
> It is obviously called async, so that adds one error point.

The stuff that we wanted your input on was actually the minimal testcase in attachment 8616266 [details] [diff] [review].

The last "is" is failing - it appears that in content scripts, the content.document.documentURI.spec does not match the gBrowser.currentURI.spec after changing the URI via content.history.pushState.

Flags: needinfo?(bugs)

Updated

•

9 years ago

Depends on: 1173215

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 23

•

9 years ago

I've created bug 1173215 for the framescript problem

Flags: needinfo?(mixedpuppy)

Comment 24

•

9 years ago

(gBrowser.currentURI refers to the URI docshell knows about and content.document.documentURI is about, well, the document URI.)

Anyhow, what is content.document.documentURI right after content.history.pushState({}, "2", "2.html"); ?

Does the testcase lead us to start loading "http://example.com", but before the load is ready,
pushState() is called, and then the page loading finishes, and document.documentURI is set back to
"http://example.com"

(I don't know how BrowserTestUtils.openNewForegroundTab behaves.)

Flags: needinfo?(bugs)

Comment 25

•

9 years ago

a web page with the following code works as expected showing code inside the web page has the correct url.  (note this cannot run as a file: url)

window.onload = function() {
    alert(document.documentURI); // original url
    history.pushState({}, "2", "2.html");
    alert(document.documentURI); // new url from pushState
}

Comment 26

•

9 years ago

(In reply to Mark Hammond [:markh] from comment #13)
> I don't understand how this will help with things extracted from things
> other than the document URL (eg, for things we walk the result of
> querySelector etc)

Now that bug 1173215 has moved forward, want to revisit this.  

There's a lot of *things* to unwrap in that comment. :)

The problem is that we don't have a solid way to know, even if history state exists, what the correct thing to do here is.  Even without history state, og:url, canonicalUrl and documentURI could all be different, so I cannot rely on comparing any of those.  With history state, we don't know if they updated those dom elements, and I don't know there is something I can look at to know the url changed.  If there were a solid reliable way to know how to detect incorrect information, I would skip the scan.  Otherwise, this is the smallest change necessary to properly use documentURI in the case where pushState has changed the url.

Input welcome, want to land a fix this week so I can uplift to aurora (assuming bug 1173215 gets uplifted)

Updated

•

9 years ago

Depends on: 1175306