Closed Bug 1127618 Opened 9 years ago Closed 9 years ago

PUSH_PROMISE is immediately rejected with a RST_STREAM

Categories

(Core :: Networking, defect)

38 Branch
x86_64
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla42
Tracking Status
firefox42 --- fixed

People

(Reporter: shawn.bissell, Assigned: u408661)

References

Details

(Whiteboard: [spdy])

Attachments

(3 files, 8 obsolete files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.91 Safari/537.36

Steps to reproduce:

Using a nghttpd 0.7.3 server setup to push resources ie
nghttpd 8443 server.key cert.pem -p/=/resources/test.css,/resources/simple.js

Hit the site using firefox nightly 38.0a1 (2015-01-29)

I have setup a server here for easier testing
https://54.169.190.11:8443/



Actual results:

Inspecting traffic in wireshark you can see that each PUSH_PROMISE frame is rejected with a RST_STREAM
the firefox log has the following message for each reset
2015-01-30 00:42:50.980485 UTC - 272728064[1007418c0]: Http2Session::RecvPushPromise 1185eb000 ID 0x2 assoc ID 0xD paddingLength 0 padded 0
2015-01-30 00:42:50.980492 UTC - 272728064[1007418c0]: Http2Session::RecvPushPromise Push Recevied without loadgroup cache
2015-01-30 00:42:50.980497 UTC - 272728064[1007418c0]: Http2Session::GenerateRst 1185eb000 0x2 7
2015-01-30 00:42:50.980510 UTC - 272728064[1007418c0]: Http2Session::LogIO 1185eb000 stream=0 id=0x0 [Generate Reset]




Expected results:

The pushed streams should be adopted and used
Chrome Version 42.0.2289.0 canary correctly adopts these pushed streams using the same test server
Hardware: x86 → x86_64
Component: Untriaged → Networking
Product: Firefox → Core
Version: Firefox 38 → 38 Branch
Whiteboard: [spdy]
This looks like exactly the same thing I had reported to me via twitter, but I was never able to reproduce locally. I'll take a look at this to see if this provides a more easily reproduced test case.
So I (FINALLY!) saw this in practice, and it appears (at first glance) to be e10s-related - I was unable to reproduce with e10s disabled (which is my default state due to some addons I use being broken with e10s), but once I enabled e10s, I can make this happen (not sure why I didn't think to try that previously).

Shawn, thanks for the report. I'll start digging in on this, and circle back around if I need anything else from you.
Assignee: nobody → hurley
Status: UNCONFIRMED → NEW
Ever confirmed: true
OK, problem found (and it is, indeed, e10s-only). Description incoming for those playing along at home:

We hang our push cache off of the loadGroup (since that's our best proxy for "these channels belong to the same page") (really, it's the loadGroupCI). Problem is, channels in the parent process in e10s don't *have* a loadGroup (because it's never been necessary, and keeping loadGroup state between child and parent would be nasty and hard, according to jduell). So, in e10s-land, no loadGroup->no loadGroupCI->no push cache->no pushes allowed.

The solution jduell and I hacked together on irc over the course of about 5 minutes is to, when the channel's loadGroup gets assigned in the child, just send across an identifier (the one time sending a pointer actually makes sense) to the parent, and use that as a hash key to look up push caches in a new, shiny, nsIPushCacheService. Then we'll be able to accept h2 pushes in e10s-land.

Details will be more involved (aren't they always?) but that's the basic idea.
Attached patch patch (obsolete) — Splinter Review
Just looking for feedback on this one, as I need to poke a bit more at the lifetimes of the scheduling contexts to convince myself that I really do have it right, and this will work (fwiw, a simple local test works just fine - pushes are successfully retrieved from the cache), and to make sure a couple try failures aren't my fault https://treeherder.mozilla.org/#/jobs?repo=try&revision=29bcd2638d46
Attachment #8579636 - Flags: feedback?(mcmanus)
at least its a small patch :(
Comment on attachment 8579636 [details] [diff] [review]
patch

Review of attachment 8579636 [details] [diff] [review]:
-----------------------------------------------------------------

broadly, this is what I thought you were talking about and it sgtm

I was just joking about the size - in general I think this makes things less complex.. except for the bits about passing an id around, but that's rather unavoidable.

as a minor thing I do feel a little squeemish about memsetting and memcpying nsID.. do you think we could give it init() and use a copy constructor?
Attachment #8579636 - Flags: feedback?(mcmanus) → feedback+
I'm fine with giving nsID an init() and copy constructor - my squeamishness was torn between the memset/memcpy pairing and modifying code under xpcom/ pretty evenly, so I chose the route that would require fewer reviewers :) I'll make that change.
Attached patch patch (obsolete) — Splinter Review
So I only added a Clear() to nsID to zero everything out - everything else
that was compiler-provided works fine (much to my chagrin, since I tried to
outsmart the compiler in my previous patch). Other than that (and the
associated changes), this is the same patch as before.

@froydnj - I'm just looking for your review on the changes to nsID.{h,cpp}
Attachment #8579636 - Attachment is obsolete: true
Attachment #8584850 - Flags: review?(nfroyd)
Attachment #8584850 - Flags: review?(mcmanus)
Oh, and here's a try run showing I didn't break anything (yet) https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9700f615104
Comment on attachment 8584850 [details] [diff] [review]
patch

Review of attachment 8584850 [details] [diff] [review]:
-----------------------------------------------------------------

r=me for the nsID bits.
Attachment #8584850 - Flags: review?(nfroyd) → review+
Comment on attachment 8584850 [details] [diff] [review]
patch

Review of attachment 8584850 [details] [diff] [review]:
-----------------------------------------------------------------

looks good for a big patch.. r+ if you agree to the comments

::: netwerk/base/SchedulingContextService.cpp
@@ +1,3 @@
> +/* This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */

2 space indents along with emacs and vi directives (this has a mix of 2 and 4)

@@ +59,5 @@
> +SchedulingContext::RemoveBlockingTransaction(uint32_t *_retval)
> +{
> +    NS_ENSURE_ARG_POINTER(_retval);
> +    mBlockingTransactionCount--;
> +    *_retval = mBlockingTransactionCount;

outval instead of retval

@@ +94,5 @@
> +NS_IMPL_ISUPPORTS(SchedulingContextService, nsISchedulingContextService, nsIObserver)
> +
> +SchedulingContextService::SchedulingContextService()
> +{
> +  MOZ_ASSERT(!sSelf, "multiple scs instances!");

anytime there is a single non atomic object like sSelf I like to see a thread assertion

::: netwerk/base/SchedulingContextService.h
@@ +34,5 @@
> +  virtual ~SchedulingContextService();
> +
> +  static SchedulingContextService *sSelf;
> +
> +  nsInterfaceHashtable<nsIDHashKey, nsIWeakReference> mTable;

so this is THREADSAFE_ISUPPORTS, how is the hashtable protected? minimally comments and thread assertions

::: netwerk/base/nsILoadGroup.idl
@@ +14,5 @@
>  
>  /**
>   * A load group maintains a collection of nsIRequest objects. 
>   */
>  [scriptable, uuid(afb57ac2-bce5-4ee3-bb34-385089a9ba5c)]

uuid bump

::: netwerk/base/nsLoadGroup.h
@@ +69,5 @@
>      uint32_t                        mDefaultLoadFlags;
>  
>      nsCOMPtr<nsILoadGroup>          mLoadGroup; // load groups can contain load groups
>      nsCOMPtr<nsIInterfaceRequestor> mCallbacks;
> +    nsCOMPtr<nsISchedulingContext>  mSCKungFuDeathGrip;

please rename mSchedulingContext

@@ +70,5 @@
>  
>      nsCOMPtr<nsILoadGroup>          mLoadGroup; // load groups can contain load groups
>      nsCOMPtr<nsIInterfaceRequestor> mCallbacks;
> +    nsCOMPtr<nsISchedulingContext>  mSCKungFuDeathGrip;
> +    nsID                            mSchedulingContextID;

as long as you have the strong reference to the nsISchedulingContext I don't think you need this ID as a member var

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +612,3 @@
>  
> +    nsresult rv;
> +    nsCOMPtr<nsISchedulingContextService> scsvc =

bonus points for having nsHttpHandler cache this service, so you can just get back a real pointer without ref counts and getService lookups

::: netwerk/protocol/http/nsIHttpChannel.idl
@@ +322,5 @@
> +
> +    /**
> +     * Identifies the scheduling context for this load.
> +     */
> +    [noscript] attribute nsID schedulingContextID;

uuid bump..
Attachment #8584850 - Flags: review?(mcmanus) → review+
(In reply to Patrick McManus [:mcmanus] from comment #11)
> Comment on attachment 8584850 [details] [diff] [review]
> patch
> 
> Review of attachment 8584850 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> looks good for a big patch.. r+ if you agree to the comments
> 
> ::: netwerk/base/SchedulingContextService.cpp
> @@ +1,3 @@
> > +/* This Source Code Form is subject to the terms of the Mozilla Public
> > + * License, v. 2.0. If a copy of the MPL was not distributed with this
> > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> 
> 2 space indents along with emacs and vi directives (this has a mix of 2 and
> 4)

Ugh, this is what I get for trying out a new editor partway through :(

> @@ +94,5 @@
> > +NS_IMPL_ISUPPORTS(SchedulingContextService, nsISchedulingContextService, nsIObserver)
> > +
> > +SchedulingContextService::SchedulingContextService()
> > +{
> > +  MOZ_ASSERT(!sSelf, "multiple scs instances!");
> 
> anytime there is a single non atomic object like sSelf I like to see a
> thread assertion
> 
> ::: netwerk/base/SchedulingContextService.h
> @@ +34,5 @@
> > +  virtual ~SchedulingContextService();
> > +
> > +  static SchedulingContextService *sSelf;
> > +
> > +  nsInterfaceHashtable<nsIDHashKey, nsIWeakReference> mTable;
> 
> so this is THREADSAFE_ISUPPORTS, how is the hashtable protected? minimally
> comments and thread assertions

I *think* that will work (I think maybe the weakref required this to be threadsafe, even though it should be main thread-only? I dunno, it felt odd to me that I needed thread safety, and I can't remember why now.) I'll have to double check, and either assert or add a lock as appropriate.

All the other changes make sense. If the thread safety bit gets more involved, I'll re-r? you to sanity check me :)
Attached patch interdiff (obsolete) — Splinter Review
Here's an interdiff from the original patch to a new version with your review comments addressed. Still works as advertised :)
Attachment #8588731 - Flags: review?(mcmanus)
Attachment #8588731 - Flags: review?(mcmanus) → review+
Attached patch patch for checkin (obsolete) — Splinter Review
Final patch for checkin whenever inbound finally opens up (got tired of waiting). This is the combo of the r+'d patch and the r+'d interdiff.

Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bbd577f0563e
Attachment #8584850 - Attachment is obsolete: true
Attachment #8588731 - Attachment is obsolete: true
Attachment #8589893 - Flags: review+
Keywords: checkin-needed
Attached patch patch for checkin (obsolete) — Splinter Review
Forgot to make this a valid-looking patch last time :(
Attachment #8589893 - Attachment is obsolete: true
Attachment #8589899 - Flags: review+
Putting this here to hopefully get feedback from people who know weakptr better than I...
Attachment #8589899 - Attachment is obsolete: true
Same here...
Comment on attachment 8620560 [details] [diff] [review]
patch as originally landed (without logging)

Benjamin - I'm having some troubles with this patch (specifically around its use of weak references) that has caused it to bounce, that I can't for the life of me figure out (matters aren't helped that the failures only seem to occur on the b2g emulator). I'm hoping to get a pair of eyes that knows more than I do about our weak references to look particularly at SchedulingContextService::GetSchedulingContext (in, appropriately, SchedulingContextService.cpp) to see if I'm doing anything totally crazy.

What SEEMS to be happening is as follows: GetSchedulingContext is called (usually by nsHttpTransaction in the crashes I've seen), and hands back what it thinks is a valid strong reference, but in actuality the memory is poisoned (mRawPtr = 0xa5a5...). When I go to use that reference later (after null-checking it), crashiness ensues. I can't for the life of me figure out why I could be getting myself in that state, but it happens nonetheless. Like I said, though, it ONLY seems to happen on the emulator, so I'm wondering if there's maybe a possible race somewhere I'm not seeing? (The class I'm holding weak references to, SchedulingContext, is declared THREADSAFE_ISUPPORTS by necessity - is that not kosher with weak refs?)

Please forward this f? to someone else if you don't have time to handle this, or if you know of someone who would be better positioned to help out. Ping me on IRC (:nwgh) if you need clarification on my questions. Thanks.
Attachment #8620560 - Flags: feedback?(benjamin)
I don't think I can help, because I don't know anything about the threading model here.

Weak-references are single-threaded by design, so if you are perhaps trying to use a weak reference across multiple threads that is bound to cause this kind of behavior. In general I recommend trying to design the code so that it doesn't need weak references.
Yes, it's seeming more and more likely that redesigning the code to avoid weak refs might be the path in my future. One clarification about the single-threaded nature of weak refs, though - if the weak ref itself is only used on one thread (main thread, in the case of this patch), but strong refs of those objects with weak refs are used on multiple threads, does that still guarantee this kind of nasty behavior? (It would seem that's the case, given what I'm seeing, but I still find it odd that, at least in infra, this behavior is only being seen on the emulator.)
Yes, that would still cause the nasty behavior. The deletion notification/weakref-nulling isn't threadsafe.

Emulators are good at detecting race bugs, although valgrind race detection would probably also notice this.
Alright, thanks! Back to ye olde drawinge boarde for me, then...
Attachment #8620560 - Flags: feedback?(benjamin)
Attached patch patch in progress (obsolete) — Splinter Review
Attachment #8620560 - Attachment is obsolete: true
Attachment #8620561 - Attachment is obsolete: true
Attachment #8630693 - Flags: feedback?(jduell.mcbugs)
And once again, I put my intended comment in the wrong area of the git-bz editor, so here's the important bits:

Jason - here's the patch discussed during the necko meeting today. The bits
that are of interest for e10s purposes are the changes to nsLoadGroup
(specifically the area in the nsLoadGroup destructor that checks for
mSchedulingContext and later IsNeckoChild) plus the associated changes to
PNecko for sending that message across. If you see anything painfully
obviously wrong (or even conceivably wrong) let me know.
And here's the try run with all the glorious, glorious red... https://treeherder.mozilla.org/#/jobs?repo=try&revision=9270911b3da7
Comment on attachment 8630693 [details] [diff] [review]
patch in progress

Review of attachment 8630693 [details] [diff] [review]:
-----------------------------------------------------------------

Basic IPDL design for removing the scheduling context looks fine.

::: netwerk/base/nsLoadGroup.cpp
@@ +140,5 @@
> +            char scid_str[NSID_LENGTH];
> +            scid.ToProvidedString(scid_str);
> +
> +            nsCString scid_nscs;
> +            scid_nscs.AssignASCII(scid_str);

Is there some way to avoid copying the string?  nsDependentCString(scid_str), etc.  Not a big deal--one extra malloc and memcpy isn't going to ruin anyone's day.
Attachment #8630693 - Flags: feedback?(jduell.mcbugs) → feedback+
Bill - Jason said you might be able to provide some insight into the whole "corrupted actor state" thing happening in the try run liked in comment 28 (the patch causing the failures in question is in comment 26). I'm totally stumped as to what insanity I could be doing to cause this. Thanks!
Flags: needinfo?(wmccloskey)
I suspect it means the PNeckoChild actor has already been freed. There are only a couple valid values for the mState enumeration and the switch statement checks all of them. So, as far as I can tell, the only way we can get to the "corrupted actor state" assertion is if mState has some garbage value.

The stack I see looks like this:
14:57:01 INFO - [Child 1963] ###!!! ABORT: corrupted actor state: file /builds/slave/try-lx-d-000000000000000000000/build/src/obj-firefox/ipc/ipdl/PNecko.cpp, line 34
14:57:28 INFO - #01: mozilla::net::PNecko::Transition(mozilla::net::PNecko::State, mozilla::ipc::Trigger, mozilla::net::PNecko::State*) [obj-firefox/ipc/ipdl/PNecko.cpp:34]
14:57:28 INFO - #02: mozilla::net::PNeckoChild::SendRemoveSchedulingContext(nsCString const&) [obj-firefox/ipc/ipdl/PNeckoChild.cpp:1036]
14:57:28 INFO - #03: nsLoadGroup::~nsLoadGroup() [netwerk/base/nsLoadGroup.cpp:146]
14:57:28 INFO - #04: nsLoadGroup::~nsLoadGroup() [memory/mozalloc/mozalloc.h:210]
14:57:28 INFO - #05: nsLoadGroup::Internal::Release() [netwerk/base/nsLoadGroup.cpp:159]
14:57:28 INFO - #06: nsLoadGroup::Release() [netwerk/base/nsLoadGroup.cpp:159]
14:57:28 INFO - #07: nsCOMPtr<nsILoadGroup>::~nsCOMPtr() [xpcom/glue/nsCOMPtr.h:391]
14:57:28 INFO - #08: nsDocLoader::~nsDocLoader() [xpcom/glue/nsTArray.h:2044]
14:57:28 INFO - #09: nsDocLoader::~nsDocLoader() [memory/mozalloc/mozalloc.h:210]
14:57:28 INFO - #10: nsDocLoader::Release() [uriloader/base/nsDocLoader.cpp:173]
14:57:28 INFO - #11: nsCOMPtr_base::assign_assuming_AddRef(nsISupports*) [xpcom/glue/nsCOMPtr.h:336]
14:57:28 INFO - #12: nsCOMPtr_base::assign_with_AddRef(nsISupports*) [xpcom/glue/nsCOMPtr.cpp:52]
14:57:28 INFO - #13: FreeFactoryEntries [xpcom/components/nsComponentManager.cpp:1248]
14:57:28 INFO - #14: nsBaseHashtable<nsIDHashKey, nsFactoryEntry*, nsFactoryEntry*>::EnumerateRead(PLDHashOperator (*)(nsID const&, nsFactoryEntry*, void*), void*) const [xpcom/glue/nsBaseHashtable.h:175]
14:57:28 INFO - #15: nsComponentManagerImpl::FreeServices() [xpcom/components/nsComponentManager.cpp:1261]
14:57:28 INFO - #16: mozilla::ShutdownXPCOM(nsIServiceManager*) [xpcom/build/XPCOMInit.cpp:910] 

Since we're shutting down XPCOM, it doesn't seem too surprising that the NeckoChild object would already be dead.
Flags: needinfo?(wmccloskey)
Well... that just makes way too much sense :) Thanks, Bill.

Jason - do we have a reliable way of knowing if gNeckoChild actually points to a valid NeckoChild object? I checked NeckoChild.cpp, but NeckoChild::DestroyNeckoChild has a comment above it that says "Note: not actually called", and that's where gNeckoChild would be nulled out, so a null check won't work (and would explain how I was able to get the corrupted actor state instead of a null pointer deref)
Flags: needinfo?(jduell.mcbugs)
The "Not actually called" comment is from a long time ago, and I'm not sure it's really true--it'd be worth checking to see if the function get entered during PContent shutdown.  It's certainly the most obvious place to null out gNeckoChild.

I know at some point we were sharing multiple children (and thus multiple PNeckos, IIUC) within the same process.  If that's still true at times, then the "static bool alreadyDestroyed = false" may be a broken way to null out gNeckoChild, and we might need to make it a static member variable instead.

Bill, do you know if we're still sometimes mashing together multiple PContent's into a single child process?
Flags: needinfo?(jduell.mcbugs) → needinfo?(wmccloskey)
Well, DestroyNeckoChild is never called statically, so we might as well delete it.

IPDL uses a hierarchical lifetime model, so a PNecko will be destroyed when its parent PContent is destroyed. That's why we don't actually leak memory here. Perhaps we could null out gNeckoChild in ~NeckoChild?

> Bill, do you know if we're still sometimes mashing together multiple PContent's into a single
> child process?

A ContentChild is always unique in a child process. I assume the same is true of a NeckoChild, since it seems like we have one per ContentChild.
Flags: needinfo?(wmccloskey)
Alright, I've tried nulling gNeckoChild in ~NeckoChild... let's see what happens! https://treeherder.mozilla.org/#/jobs?repo=try&revision=a1e46d92203e
> DestroyNeckoChild is never called statically, so we might as well delete it.

+1.  It's confusing to leave in code that doesn't get called.  I assume we need to keep the function defined, but we should remove the current, unused body.
(In reply to Jason Duell [:jduell] (needinfo? me) from comment #36)
> +1.  It's confusing to leave in code that doesn't get called.  I assume we
> need to keep the function defined, but we should remove the current, unused
> body.

I think we can actually delete it. It doesn't have any significance for IPDL. I assume it had a caller somewhere that was deleted or something.
Well, the try run in comment 35 looks MUCH better! Same patch as before, but I deleted DestroyNeckoChild and null out gNeckoChild in ~NeckoChild - now everything seems happy!

I'll double-check to make sure my push tests still work (no reason they shouldn't) and then it looks like we'll actually be ready to go!
This is just a patch to clean up some compile errors I found in the latest rebased version of my patch - namely, some compile errors caused because some files moved into a new Unified_netwerk_base_#.cpp
Attachment #8638120 - Flags: review?(mcmanus)
Attached patch patchSplinter Review
And here's the patch that actually fixes push in e10s-land. Almost the same as the one Jason f+'d, but now doesn't cause shutdown crashes :)
Attachment #8630693 - Attachment is obsolete: true
Attachment #8638121 - Flags: review?(mcmanus)
Comment on attachment 8638121 [details] [diff] [review]
patch

Review of attachment 8638121 [details] [diff] [review]:
-----------------------------------------------------------------

thanks nick and jason
Attachment #8638121 - Flags: review?(mcmanus) → review+
Attachment #8638120 - Flags: review?(mcmanus) → review+
The push in comment 43 was to remove a debugging printf that got left hanging around
Depends on: 1198058
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: