fatal assert in nsPrincipal::GetAppId()

RESOLVED FIXED in mozilla32

Status

()

RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: Dolske, Assigned: bzbarsky)

Tracking

(Depends on: 1 bug)

unspecified
mozilla32
x86
Mac OS X
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 1 obsolete attachment)

(Reporter)

Description

5 years ago
My patch in bug 839961 is a rather mundane refactoring of some login manager code. Most of it just moving some code from an xpcom component to a JSM, and adding a couple of event listeners in a frame script.

Unfortunately it is reliably causing a bizarre assert in test_printpreview_bug396024.xul, part of mochitest-chrome / M(oth), on Windows and Ubuntu debug builds. Oddly OS X seems to work fine (no print preview there, I guess?).

https://tbpl.mozilla.org/?tree=Try&rev=70157d046e35

From the log:

INFO -  Assertion failure: false, at ../../../caps/src/nsPrincipal.cpp:531

TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/layout/base/tests/chrome/test_printpreview_bug396024.xul | Exited with code 11 during test run

Here's the stack (edited for readability):

 -  Thread 0 
 0  libxul.so!nsPrincipal::GetAppId
 1  libxul.so!nsPermissionManager::CommonTestPermission
 2  libxul.so!nsComponentManagerImpl::GetServiceByContractID
 3  libplds4.so!PL_HashTableLookup [plhash.c:70157d046e35 : 333 + 0xc]
 4  libxul.so!nsContentUtils::IsSystemPrincipal
 5  libxul.so!nsPermissionManager::TestPermissionFromPrincipal
 6  libxul.so!TestSitePerm [nsContentUtils.cpp:70157d046e35 : 2895 + 0x18]
 7  libxul.so!nsPrincipal::GetDomain
 8  libxul.so!nsContentUtils::AllowXULXBLForPrincipal
 9  libxul.so!XPCWrappedNativeScope::XPCWrappedNativeScope
10  libxul.so!xpc::CreateGlobalObject
11  libxul.so!mozilla::BlockingResourceBase::PrintCycle
12  libxul.so!XPCAutoLock::~XPCAutoLock
13  libxul.so!XPCNativeScriptableInfo::Construct
14  libxul.so!XPCWrappedNative::WrapNewGlobal
15  libxul.so!nsLogAddRefCamera [CameraCommon.h:70157d046e35 : 110 + 0xe]
16  libxul.so!nsLogReleaseCamera [CameraCommon.h:70157d046e35 : 115 + 0xe]
17  libxul.so!nsDOMClassInfo::Release
18  libxul.so!xpcObjectHelper::GetScriptableFlags
19  libxul.so!nsXPConnect::InitClassesWithNewWrappedGlobal
20  libxul.so!pref_HashTableLookup
21  libxul.so!nsGlobalWindow::SetNewDocument
22  libxul.so + 0x1021cb1
23  libxul.so!nsCOMPtr<nsINode>::operator=
24  libxul.so!nsDocumentViewer::SetDocumentInternal
25  libxul.so!NS_TableDrivenQI
26  libxul.so!nsHTMLDocument::QueryInterface
27  libxul.so!nsDocumentViewer::SetDOMDocument
28  libxul.so!nsPrintObject::Init
29  libxul.so!GetBloatEntry [nsTraceRefcntImpl.cpp:70157d046e35 : 429 + 0x7]
30  libxul.so!nsPrintEngine::DoCommonPrint
31  libxul.so!NS_LogAddRef [nsTraceRefcntImpl.cpp:70157d046e35 : 251 + 0x7]
32  libnspr4.so!PR_Unlock [ptsynch.c:70157d046e35 : 205 + 0x7]
33  libxul.so!nsPrintEngine::CommonPrint
34  libxul.so!nsPrintEngine::PrintPreview
35  libxul.so!nsDocumentViewer::PrintPreview
36  libxul.so!NS_InvokeByIndex [xptcinvoke_x86_64_unix.cpp:70157d046e35 : 164 + 0x40]
37  libxul.so!CallMethodHelper::InitializeDispatchParams
38  libxul.so!XPCJSRuntime::SetPendingException
39  libxul.so!CallMethodHelper::Call
40  libnspr4.so!PR_Unlock [ptsynch.c:70157d046e35 : 205 + 0x7]
41  libxul.so!xptiInterfaceEntry::GetMethodInfo
42  libxul.so!XPCWrappedNative::CallMethod
43  libxul.so + 0x11ce955
44  libxul.so!NS_LogAddRef [nsTraceRefcntImpl.cpp:70157d046e35 : 251 + 0x7]
45  libnspr4.so!PR_Unlock [ptsynch.c:70157d046e35 : 205 + 0x7]
46  libxul.so!IS_WN_WRAPPER_OBJECT [xpcpublic.h:70157d046e35 : 110 + 0x6]
47  libxul.so!XPCCallContext::XPCCallContext
48  libxul.so!js::GCMarker::markDelayedChildren
49  libxul.so!XPC_WN_GetterSetter
50  libxul.so!XPC_WN_CallMethod
(Reporter)

Comment 1

5 years ago
Hmm, camera stuff around frame 16? Do we really muck around that in the middle of print preview? Or is some of this stack lies?
Where is the principal that's involved here created?
(Reporter)

Comment 3

5 years ago
A little more info:

This is easily reproducible on my Windows box. Apply the main patch from bug 839961, and run |mach mochitest-chrome layout/base/tests/chrome/test_printpreview_bug396024.xul|.

The stack above does have some lies in it. MSVC gives me http://pastebin.mozilla.org/2529483, which is basically a setTimeout running some JS, which calls into nsDocumentViewer::PrintPreview().

  *fatal assert*
  nsPrincipal::GetAppId
  nsPermissionManager::CommonTestPermission
  nsPermissionManager::TestPermissionFromPrincipal
  TestSitePerm
  nsContentUtils::IsSitePermAllow
  nsContentUtils::AllowXULXBLForPrincipal
  TreatAsRemoteXUL
  CreateNativeGlobalForInner
  nsGlobalWindow::SetNewDocument
  nsDocumentViewer::SetDocumentInternal
  nsDocumentViewer::SetDOMDocument
  nsPrintObject::Init
  nsPrintEngine::DoCommonPrint
  nsPrintEngine::CommonPrint
  nsPrintEngine::PrintPreview
  nsDocumentViewer::PrintPreview
  ...JS...
  ...Timer...

DumpJSStack() seems to be broken for me, but the URI being passed in to CreateNativeGlobalForInner() 8 frames from the top is "chrome://mochitests/content/chrome/layout/base/tests/chrome/printpreview_bug396024_helper.xul".

Some dump debugging pins this as happening at the end of http://mxr.mozilla.org/mozilla-central/source/layout/base/tests/chrome/printpreview_bug396024_helper.xul; we're in run5() calling printpreview() calling gWbp.printPreview().
(Reporter)

Comment 4

5 years ago
Created attachment 763191 [details]
principal creation

Ok, I set a breakpoint in nsPrincipal::Init(), and then clicked to start the test. There were a handful (< 10) initialized with aAppId == 0, and then one created with 4294967295 (-1). Continuing on from that then hit the fatal assert. Attached is the stack from that principal's creation.

Does this help? It all means nothing to me. :-)
(Reporter)

Comment 5

5 years ago
I'm utterly baffled at what my patch from bug 839961 is doing to break this...

If I revert the changes to browser/base/content/content.js (a frame script), the test still asserts and dies.

If I then remove the single LoginManangerContent call in nsLoginManager.js, it still asserts.

If I then remove nsLoginManager's lazyModuleGetter for this new JSM, and also comment out observer setup that the module does on first load, it... wait for it... still asserts.

At this point the only change from stock m-c is that I've deleted a bunch of code from nsLoginManager, and have a module which nothing calls into, should never be instantiated, and even if it did doesn't do anything but create a couple objects with methods on them.

I guess the deleted code from login manager was causing some subtle side effect that other code has come to incorrectly rely upon?
(Reporter)

Comment 6

5 years ago
(Oh, and I also reverted fully back to m-c and ran the test again -- works flawlessly.)
(Reporter)

Comment 7

5 years ago
Yeah, so. From a stock mozilla-central...

1) run test for sanity --> test ok
2) comment out the addProgressListener() call from the init() in
   nsLoginManager.js --> test asserts
3) restore addProgressListener() call, but put an immediate return in the
   listener's onStatusChange --> test ok

So it would appear something is relying on some side effect of having a web progress listener present, even if it doesn't actually do anything.

[Background/context: login manager does 3 basic things: provide an API for login manipulation, watch for form submissions to save passwords, and watch for page loads to autofill logins. Only the last of these is relevant here. It uses a web progress listener to watch for documents starting to load, upon which it adds a DOMContentLoaded listener, and once that fires for a page we try to autofill logins. One of the changes in bug 839961 is to remove the web progress listener, and just add the DOMContentLoaded listener straight from a frame script.]
Sounds vaguely related to bug 878886.
Oh, re-reading the stack in comment 0, it sounds like the same issue actually. The presence of a WPL probably affects the behavior of nsScriptSecurityManager::GetChannelPrincipal ?
(Reporter)

Comment 10

5 years ago
That bug and the stack in attachment 763191 [details] both go through nsScriptSecurityManager::GetChannelPrincipal(), so I guess it could be the same thing.
(Reporter)

Comment 11

5 years ago
I'm landing the fix to bug 839961 with a gross workaround (attachment 763783 [details] [diff] [review]) to avoid hitting this bug. Would still like to fix this to remove the gross hack, so login manager code doesn't have to be called for every document load.
Basically, when creating the channel you need to provide an nsILoadContext (which is retrived through the loadgroup) which returns NO_APP_IP from the appid getter.

Comment 13

4 years ago
In the attached stack trace, it seems that the issue occurs during the cloning made as part of creating the print preview document.

Any idea about why cloning the document is sometimes triggering the same issue as bug 878886? It seems related to timing or some relationship between the various objects being set lazily.
Created attachment 8418477 [details] [diff] [review]
patch?

I think I narrowed this down... The document in question has no docshell because its an empty "clone" doc created by nsHTMLDocument::Clone. Copying over the loadgroup state from the document being cloned fixes this locally for me.

I pushed that to try to confirm:
https://tbpl.mozilla.org/?tree=Try&rev=5269ec1c3a64

Not sure what other implications this might have, or whether a more specific fix would be better.
Attachment #8418477 - Flags: feedback?(jonas)
No longer blocks: 839961
(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #14)
> I pushed that to try to confirm:
> https://tbpl.mozilla.org/?tree=Try&rev=5269ec1c3a64

No luck :( Will need to try again tomorrow.
Attachment #8418477 - Flags: feedback?(jonas)
OK, I think I figured this out. Reproducing locally was not 100% reliable which made things annoying.

The test is causing us to clone a document for print preview. Cloning ends up calling Reset(), which calls GetChannelPrincipal to get a principal from the document's current channel, and eventually ends up setting it as the document's new principal (not sure what the point of this is...).

GetChannelPrincipal tries to get the relevant docShell by querying the channel's notificationCallbacks, but the channel (which is an nsBaseChannel for the document's initial about:blank load) has no notificationCallbacks anymore, because its mCallbacks is cleared in nsBaseChannel's OnStopRequest (http://hg.mozilla.org/mozilla-central/annotate/8bd82defae82/netwerk/base/src/nsBaseChannel.cpp#l741). So we fall into the GetCodebasePrincipalInternal(uri, UNKNOWN_APP_ID, false) case (http://hg.mozilla.org/mozilla-central/annotate/8bd82defae82/caps/src/nsScriptSecurityManager.cpp#l309), which means the document's principal ends up being set to an UNKNOWN_APP_ID principal, and so then comment 3 happens.

I think the only reason this isn't a perma-orange is because things work if the about:blank load hasn't hit OnStopRequest before we try cloning. The WPL hack from bug839961 happens to adjust timing to ensure that usually happens, apparently. Bug 910521 suggests there are other cases where it doesn't happen.

This seems to be a print-preview- (or document-cloning-) specific issue, but not sure how to fix it. Should Reset() do something different?

As a hackaround in the interim, we can probably adjust this test to avoid cloning a document whose about:blank load has finished.

Or we could just fix bug 878886 and remove this assertion, because I don't think it's getting us anything useful (we've created plenty of valid scenarios where this happens in practice).
Attachment #8418477 - Attachment is obsolete: true
(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #16)
> Or we could just fix bug 878886 and remove this assertion, because I don't
> think it's getting us anything useful (we've created plenty of valid
> scenarios where this happens in practice).

Removing the assertion and failing the permissions check as mcmanus suggests in bug 878886 comment 2 seems reasonable to me.

Sicking, what do you think?
Flags: needinfo?(jonas)
Cloning should use the principal of the document being cloned, I'd think.  And in fact, nsDocument::CloneDocHelper does the right SetPrincipal.  But then we go and call Reset()... why are we doing that?  We've already set the base URI and principal; the only other things it does are setting mChannel (which we could do directly) and calling ResetToURI.

Olli, can we nix that Reset call?
Flags: needinfo?(bugs)
There used to be some reason for the Reset, perhaps it was just mChannel.

But just calling ResetToURI and setting mChannel manually might be ok.
Flags: needinfo?(bugs)
Created attachment 8424095 [details] [diff] [review]
Make cloning static documents not reset the principal on the clone after setting it to the right thing

Justin, does this fix the issue you were running into?
Attachment #8424095 - Flags: feedback?(dolske)
Assignee: nobody → bzbarsky
Status: NEW → ASSIGNED
(In reply to Boris Zbarsky [:bz] from comment #20)
> Created attachment 8424095 [details] [diff] [review]
> Make cloning static documents not reset the principal on the clone after
> setting it to the right thing

Thanks bz!

I pushed a removal of the workaround to try, if this goes green we're good:
https://tbpl.mozilla.org/?tree=Try&rev=8c2ab86c309b

I also pushed without your fix just to confirm that it's still orange:
https://tbpl.mozilla.org/?tree=Try&rev=55507c4b01bc
Flags: needinfo?(jonas)
(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #21)
> I pushed a removal of the workaround to try, if this goes green we're good:
> https://tbpl.mozilla.org/?tree=Try&rev=8c2ab86c309b
> 
> I also pushed without your fix just to confirm that it's still orange:
> https://tbpl.mozilla.org/?tree=Try&rev=55507c4b01bc

We have a winner!
Attachment #8424095 - Flags: feedback?(dolske) → feedback+
Comment on attachment 8424095 [details] [diff] [review]
Make cloning static documents not reset the principal on the clone after setting it to the right thing

Awesome, thanks!  What's the workaround we should remove?
Attachment #8424095 - Flags: review?(bugs)
Flags: needinfo?(gavin.sharp)
Created attachment 8424529 [details] [diff] [review]
remove workaround

The workaround removal is https://hg.mozilla.org/try/rev/55507c4b01bc (attached here).
Flags: needinfo?(gavin.sharp)
Comment on attachment 8424095 [details] [diff] [review]
Make cloning static documents not reset the principal on the clone after setting it to the right thing

Hmm, should we use NS_GetFinalChannelURI, not GetURI, for consistency.
Attachment #8424095 - Flags: review?(bugs) → review+
> Hmm, should we use NS_GetFinalChannelURI, not GetURI, for consistency.

Yes!  Good catch!
https://hg.mozilla.org/mozilla-central/rev/8017511134f1
https://hg.mozilla.org/mozilla-central/rev/e79a7e494667
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
You need to log in before you can comment on or make changes to this bug.