Closed Bug 1337783 Opened 4 years ago Closed 7 months ago

Assertion failure: !mTrackingScripts.Contains(aURL), at nsDocument.cpp:3035

Categories

(Core :: DOM: Core & HTML, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: cbook, Unassigned)

References

()

Details

(Keywords: assertion)

Crash Data

Attachments

(2 files)

found via bughunter and reproduce with latest m-c tinderbox debug build on windows 7 

Steps to reproduce:
-> Load 	https://www.privesc.eu/
--> Assertion failure 

Assertion failure: !mTrackingScripts.Contains(aURL), at c:/builds/moz2_slave/m-cen-w32-d-000000000000000000/build/src/dom/base/nsDocument.cpp:3035
This was added in bug 1321868.
Blocks: 1321868
Flags: needinfo?(ehsan)
Does this happen reliably?  Any chance you could elaborate how you reproduce this?  I tried to reproduce unsuccessfully.
Flags: needinfo?(ehsan) → needinfo?(cbook)
working on this and will report back
Flags: needinfo?(cbook)
(In reply to Carsten Book [:Tomcat] from comment #3)
> working on this and will report back

Please let us know when you've had a chance to reproduce. Thanks.
Flags: needinfo?(cbook)
fyi, I submitted this url to bughunter twice and could not reproduce. I also couldn't find this in production bughunter at all. Tomcat: was this on your local instance?
yeah so far didn't crash anymore, closing as incomplete!
Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(cbook)
Resolution: --- → INCOMPLETE
Nightly 57 from today, Fedora 26:
https://kobieta.onet.pl/brigitte-macron-w-cygaretkach-oraz-bluzce-z-odslonietymi-ramionami-podczas-oficjalnej/10gjvy3

The first time I loaded the url, it took a while to assert. The second time I had to reload several times. Bughunter has 57 urls where this has been reproduced in the last month.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Attached file Debug Log
This is from my local build which reproduced the assertion after several reloads.
Looks like we have a URL here, Ehsan.
Flags: needinfo?(ehsan)
Priority: -- → P2
I tried loading the URL in comment 7 tens of times without any chance to reproduce the crash on trunk.

Bob, you mentioned there are 57 URLs that trigger this assertion.  Any chance you can post them here please?  Thanks!
Flags: needinfo?(ehsan) → needinfo?(bob)
I would have to screen them first for content or pii. I assume that you are primarily looking for a reproducible test case. I retested the urls that produced this assertion going back to 7/19 and only had two instances where it reproduced. I am beginning to think this might be ad related which would account for its lack of reproducibility. 

One of the urls is the one in this bug. https://kobieta.onet.pl/brigitte-macron-w-cygaretkach-oraz-bluzce-z-odslonietymi-ramionami-podczas-oficjalnej/10gjvy3 reproduced on Windows 7 32bit. I manually reproduced it there a couple of times in quick succession but have been unable to do so again.

The other url https://career.aglasem.com/ssc-cpo-si-asi-admit-card/ was also reproduced on Windows 7 32bit. I have managed to reproduce it 3 times on Fedora 26 x86_64 with a fresh build from this afternoon. I reloaded the page if it did not reproduce on the first load.

If you really want the full list, just NI me again and I'll screen the urls.
Flags: needinfo?(bob)
(In reply to Bob Clary [:bc:] from comment #11)
> I would have to screen them first for content or pii. I assume that you are
> primarily looking for a reproducible test case. I retested the urls that
> produced this assertion going back to 7/19 and only had two instances where
> it reproduced. I am beginning to think this might be ad related which would
> account for its lack of reproducibility. 

Yes, I'm only interested in reproducible cases indeed.

And this is definitely going to depend on tracking scripts typically used by ads, so by definition we're gonna have a hard time reproducing it sadly...

> One of the urls is the one in this bug.
> https://kobieta.onet.pl/brigitte-macron-w-cygaretkach-oraz-bluzce-z-
> odslonietymi-ramionami-podczas-oficjalnej/10gjvy3 reproduced on Windows 7
> 32bit. I manually reproduced it there a couple of times in quick succession
> but have been unable to do so again.
> 
> The other url https://career.aglasem.com/ssc-cpo-si-asi-admit-card/ was also
> reproduced on Windows 7 32bit. I have managed to reproduce it 3 times on
> Fedora 26 x86_64 with a fresh build from this afternoon. I reloaded the page
> if it did not reproduce on the first load.

Well let me first try out those two URLs tomorrow and see if I hit the assertion with one of them, and if not I'll bother you again!  Thanks a lot for your help so far.
Attached file tracking-urls.txt
There where 59 total urls. I scrubbed two. I ran these on a debug build on Fedora using the marionette based runner and reproduced this assertion on

https://career.aglasem.com/ssc-cpo-si-asi-admit-card/ 

I repeated manually and also reproduced again. A saved version did not reproduce so get it while you can.
So what's happening here is that at the time the assertion occurs, the channel in the child process doesn't get the HttpChannelChild::ProcessNotifyTrackingResource() function called on it yet so we think the channels resource isn't a tracking resource.

I noticed that bug 1015466 and bug 1338493 have completely changed the way that these notifications are delivered in between the parent and child processes since I wrote this code.  We send this notification from the parent in <https://searchfox.org/mozilla-central/rev/d08b24e613cac8c9c5a4131452459241010701e0/netwerk/protocol/http/HttpChannelParent.cpp#1751> and expect to receive it in <https://searchfox.org/mozilla-central/rev/d08b24e613cac8c9c5a4131452459241010701e0/netwerk/protocol/http/HttpBackgroundChannelChild.cpp#404>.

SC, in the current setup, is there anything that guarantees that RecvNotifyTrackingResource() gets called before OnStopRequest()?  That seems to not be the case.  Can we enforce such a guarantee somehow?
Flags: needinfo?(schien)
After bug 1325054 is landed, RecvNotifyTrackingResource is always called before OnStartRequest. Therefore we can guarantee that RecvNotifyTrackingResource will get called before OnStopRequest.
Flags: needinfo?(schien)
(In reply to (PTO 9/21 - 9/24) Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #16)
> After bug 1325054 is landed, RecvNotifyTrackingResource is always called
> before OnStartRequest. Therefore we can guarantee that
> RecvNotifyTrackingResource will get called before OnStopRequest.

Doesn't bug 1370211 comment 3 kind of contradict what you said here?  :-)

I have tried to debug this for several days, and have caught this under rr a couple of times.  Frustratingly, each time I haven't been able to replay the parent process inside rr so haven't been able to check what is happening on the parent side, but I have checked all of the logic on the child side and from what I can tell, the *only* way this can be happening is if RecvNotifyTrackingResource gets called too late in some situations.  This was happening with the Google Analytics script URL for me, so I'm 100% sure that this is a race condition based on that, since that script is classified as a tracking resource and there is no reason for RecvNotifyTrackingResource to not get called in the child process if a classification on that URL is triggered.

Is there any chance you can have a look at this please?  I'm not familiar with the Necko parts involved here, they're all new to me.  :-)  Thanks!
Flags: needinfo?(schien)
Yes I'm already digging to this bug, just too lazy to assign it to myself. ;)
Assignee: nobody → schien
Flags: needinfo?(schien)
SC and I found that this should have nothing to do with RecvNotifyTrackingResource being called too late.

From the log, I saw that the http channel at child side was intercepted by a service worker. AsyncOpen returned at [1]. So, nsChannelClassifier in parent process was not involved to check this URI. That's why this channel is not marked as tracking resource.

I think we should somehow make nsChannelClassifier work in child process, so the intercepted channel can by marked as tracking resource.

Ehsan, Francois, what do you think?

[1] https://searchfox.org/mozilla-central/rev/f54c1723befe6bcc7229f005217d5c681128fcad/netwerk/protocol/http/HttpChannelChild.cpp#2406
Flags: needinfo?(francois)
Flags: needinfo?(ehsan)
Thanks @kershaw for figuring out the root cause! Unassigned myself based on comment #20, since this might not be fixed in Necko.
Assignee: schien → nobody
Oh interesting, thanks for the investigation.

I guess we can use asyncClassifyLocalWithTables() from the content process in order to classify these URLs.  However we have been trying to avoid making an IPC roundtrip to the parent process for channels that are intercepted by a service worker so far.  The result of this classification needs to arrive before OnStopRequest, but still it would be nice if we could avoid the IPC if we can.  Francois, can you think of a way to do that?

Also, should we change the code in nsChannelClassifier to do this in addition to what it currently does, or just add some custom code for the case of intercepted channels?
Flags: needinfo?(ehsan)
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #22)
> The result of this classification
> needs to arrive before OnStopRequest, but still it would be nice if we could
> avoid the IPC if we can.  Francois, can you think of a way to do that?

Are we talking about the classification in the case where TP is in "passive" mode (not blocking anything) or when it's turned ON? In the case where it's turned on, the result needs to arrive before we even connect to the server.

I'm not sure if we can easily avoid going to the main thread for the classification. Updates need to be able to block local list lookups while they swap the old list with the new one.
Flags: needinfo?(francois)
(In reply to François Marier [:francois] from comment #23)
> (In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from
> comment #22)
> > The result of this classification
> > needs to arrive before OnStopRequest, but still it would be nice if we could
> > avoid the IPC if we can.  Francois, can you think of a way to do that?
> 
> Are we talking about the classification in the case where TP is in "passive"
> mode (not blocking anything) or when it's turned ON? In the case where it's
> turned on, the result needs to arrive before we even connect to the server.
> 

Yes, we are talking about the channel annotation. SetIsTrackingResource() is only called when TP is disabled.

> I'm not sure if we can easily avoid going to the main thread for the
> classification. Updates need to be able to block local list lookups while
> they swap the old list with the new one.

The thing in this bug is that there is a tracking script which is loaded multiple times on a site where there is a service worker registered. At first time that script is loaded from network, and the second time it is from the service worker.
When a http channel is intercepted by a service worker in child process, the child channel will not send a IPC to parent process to build the real http channel. Unfortunately, the channel classification is only happened in parent process now. So, the child channel will not be marked as a tracking resource.

I think maybe we can add a table in nsChannelClassifier to cache tracking urls. In this way, we can do the url classification for the intercepted channel in child process. In parent process, we can also avoid the redundant db check if there is a hit in the table.

Any thoughts?
Flags: needinfo?(francois)
Here is an alternative proposal:

Perhaps it would make sense to record in the DOM Cache whether a channel came from a tracking resource, and when synthesizing a response from a cache entry we can mark the channel as tracking if the cached content originally came from ma tracking channel?
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #25)
> Perhaps it would make sense to record in the DOM Cache whether a channel
> came from a tracking resource, and when synthesizing a response from a cache
> entry we can mark the channel as tracking if the cached content originally
> came from ma tracking channel?

If you want an invariant, then I don't think this is adequate.  Many sites with synthesize a new Response based on the fetch'd Response.  They do this to tweak the Response, etc.  This would lose your hidden flag.

Is this something that could wait until after we move interception to the parent?
Flags: needinfo?(francois)
(In reply to Ben Kelly [:bkelly] from comment #26)
> (In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from
> comment #25)
> > Perhaps it would make sense to record in the DOM Cache whether a channel
> > came from a tracking resource, and when synthesizing a response from a cache
> > entry we can mark the channel as tracking if the cached content originally
> > came from ma tracking channel?
> 
> If you want an invariant, then I don't think this is adequate.  Many sites
> with synthesize a new Response based on the fetch'd Response.  They do this
> to tweak the Response, etc.  This would lose your hidden flag.

Yes, good point.  That solution would only work for the simplest case of the Response being cached untouched, so we'd still have this problem for other cases.

> Is this something that could wait until after we move interception to the
> parent?

Yes I think so.  As things are, the implication of this bug is that we may think some tracking scripts are non-tracking and end up not throttling them which is unfortunate but not disastrous.

Do you mind setting up the correct bug dependencies so that we remember to get back to this once the work to move the interception to the parent process is finished please?  Thanks!
Component: DOM → DOM: Core & HTML
Duplicate of this bug: 1556448
Crash Signature: [@ mozilla::dom::Document::NoteScriptTrackingStatus mozilla::dom::ScriptLoader::FillCompileOptionsForRequest mozilla::dom::ScriptLoader::AttemptAsyncScriptCompile mozilla::dom::ScriptLoader::CompileOffThreadOrProcessRequest mozilla::dom::ScriptLoader::Pro…

Closing because no crashes reported for 12 weeks.

Status: REOPENED → RESOLVED
Closed: 4 years ago1 year ago
Resolution: --- → WORKSFORME

Bughunter can still reproduce this on a number of urls though they are not 100% reproducible.

I did just reproduce this assertion twice in a row on Fedora 30 Nightly debug with https://www.jeux-gratuits.com/jeux-de-plateau.html though it stopped reproducing after that.

Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---

Closing because no crashes reported for 12 weeks.

Status: REOPENED → RESOLVED
Closed: 1 year ago11 months ago
Resolution: --- → WORKSFORME

This is an assertion failure which would not appear in Socorro crash reports. It is also apparently not covered by any tests therefore sheriffs will not record failures. However, during the end of year retest of Bughunter crashing urls, I've reproduced this assertion on 118 urls in the last 5 days.

Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---

This is still happening in Bughunter and is one of the most common crashes I am seeing. If we aren't going to fix this, would it be ok to convert the fatal assertion to a NS_ASSERTION and move on? Otherwise this is just preventing me from find other crashes.

I think ehsan might be the right person to answer to that.

Flags: needinfo?(ehsan)

I'm not actually, Kershaw is probably the right person.

Now that we have parent process interception for service workers riding the trains in 73, do you have any plans to continue the work where we left it off (comment 24)? Thanks!

Flags: needinfo?(ehsan)
Flags: needinfo?(kershaw)

(In reply to :ehsan akhgari from comment #36)

I'm not actually, Kershaw is probably the right person.

Now that we have parent process interception for service workers riding the trains in 73, do you have any plans to continue the work where we left it off (comment 24)? Thanks!

Sorry that I think I am not the right one to fix this right now.
Maybe Dimi can help?

Flags: needinfo?(kershaw) → needinfo?(dlee)

(In reply to Kershaw Chang [:kershaw] from comment #37)

Sorry that I think I am not the right one to fix this right now.
Maybe Dimi can help?

Talked to Kershaw yesterday, a lot of things have changed since the bug was filed, we need to figure out what the current situation is.
A lot of stuff on my plate right now so I won't be able to check it soon, if it is ok to wait for a while, then I can help :)

Flags: needinfo?(dlee)

Since ServiceWorker intercepts channels in the parent now, this bug can probably be solved by Bug 1437626

Depends on: 1437626

Closing because no crashes reported for 12 weeks.

Status: REOPENED → RESOLVED
Closed: 11 months ago7 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.