Closed Bug 664163 Opened 8 years ago Closed 4 years ago

Fix Get(Local|Remote)(Address|Port) in HttpChannelChild

Categories

(Core :: Networking: HTTP, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla14
Tracking Status
e10s m8+ ---
firefox46 --- fixed

People

(Reporter: jduell.mcbugs, Assigned: dragana)

References

Details

Attachments

(3 files, 4 obsolete files)

Attached patch Fix getlocal functions (obsolete) — Splinter Review
Minor followup. Details in bug 648878 comment 13 and later.  Opening this just so we stick to "one bug, one patch."  +r on patch copied from there.
Attachment #539226 - Flags: review+
Backed out in http://hg.mozilla.org/integration/mozilla-inbound/rev/d8941c23dcc7 due to causing a high frequency orange on Linux debug in test_traceable_channel_wrap.js.
Whiteboard: [inbound]
Problem was a one-liner--JS object that received observer notifications needed to be global so it didn't go out of scope by time observer tried to notify it.

https://hg.mozilla.org/integration/mozilla-inbound/rev/a2d511a124d7
OK, here's another go at it, with a couple possible fixes, and better info about what call may be failing.

  https://hg.mozilla.org/integration/mozilla-inbound/rev/fd30aea30b03

I hate to use inbound as my try server here, but I've tried running test_traceable_channel_wrap.js on regular tryserver 200 times, and haven't seen an error.  Any hypotheses about how the inbound buildbots may differ from try would be welcome--I'm fumbling in the dark here.  I did try building with python 2.5 vs 2.7, but that's not it.
Backed out yet again, but I have a better idea which tree to bark up--error happening only with local addrs/ports not remote ones.
Target Milestone: --- → mozilla14
  https://hg.mozilla.org/integration/mozilla-inbound/rev/9899522b6ad9

Yet another go-round on inbound, this time with better diagnostic info so we can see why this only seems to fail on inbound and not try (or my linux box).

When this gets backed out please mention if the log contains either/both of these messages:

  Parent: socket has multiple address families!
  Child: socket has multiple address families!
It was backed out, but the test timed out this time so there was no debug output at all.
philor backed this out:
 https://hg.mozilla.org/integration/mozilla-inbound/rev/e81b56331d42

Aside: He also rightly pointed out that, since you already knew this was going to need backing out (per your commit message "Yes this will get backed out"), you actually could have backed it out yourself right away (or rather, as soon as builds were scheduled), instead of knowingly letting it fester for someone else (someone volunteering his time, in this case) to catch later. :)  You still would have gotten a full run of tests, which it looks like you were looking for.

As for the logs -- as jdm said, they were all timeouts, sadly.  For reference, the log URLs are:
  https://tbpl.mozilla.org/php/getParsedLog.php?id=10052061&tree=Mozilla-Inbound
  https://tbpl.mozilla.org/php/getParsedLog.php?id=10052296&tree=Mozilla-Inbound
  https://tbpl.mozilla.org/php/getParsedLog.php?id=10052473&tree=Mozilla-Inbound
  https://tbpl.mozilla.org/php/getParsedLog.php?id=10050949&tree=Mozilla-Inbound
  https://tbpl.mozilla.org/php/getParsedLog.php?id=10051915&tree=Mozilla-Inbound
I'm punting on this bug for now: e10s isn't important enough at the moment to wade into debugging this.  For the record, I'm starting to suspect that the root cause here is the same as bug 661158, i.e. some kind of intermittent IPDL serialization error, except this time on our linux buildbots.

Here's a run-down of the issue:

- The symptom: in test_traceable_channel_wrap.js we occasionally can get the socket's remote host/port but not the local addr/port.  This should not be possible--if a load came from cache we'd see neither local/remote, but the cache is turned off for xpcshell (and the test would fail in that case too, yet it never does on non-e10s builds). 

- The only place HttpBaseChannel::mSelfAddr gets set on the parent is during nsHttpChannel::OnTransportStatus, and it always either sets both local/remote, or neither (if the request is from cache).  As discussed in bug 661158 comment 17, there's no chance of a race existing between when we set these and when we send them to the child in HttpChannelParent::SendOnStartRequest.

- One big difference between this and bug 661158: instead of the socket addr being set to family PR_AF_LOCAL (totally nonsensical), it's set to PR_AF_UNSPEC (aka, 0:  also the default value set in the constructor).  That's why this isn't a crash.  It may also mean that I'm just missing some way that the value might remain at the default, but I don't think so (we're setting it in HttpChannelChild::OnStartRequest immediately before we call the test's OnStartRequest that checks the value).

- I'm seeing this only in e10s, and only on our linux buildbots (I've modified test_traceable_channel.js so I can run the logic 1000's of times in one run, and haven't seen it on my own linux box, nor on the tryservers).

Meanwhle I have two ideas to further debug this:  1) make sure we don't ever see a set/unset local/remote addr pair in the parent, before we send to the child; and 2) reorder the parameters in SendOnStartRequest to send the addr params first, so we can isolate if some other parameter is causing the problem.
Orig fix with a couple updates (PRBool->bool).  Still runs into occasional crash on buildbots.
Attachment #539226 - Attachment is obsolete: true
Attachment #610620 - Flags: review+
Fixes JS observer to stay in scope throughout test.  Adds better checks to see which getLocal functions are failing.
Sanity check that local/remote socket families are the same at OnStartReq time.
Not working on this until e10s is more relevant again.
Assignee: jduell.mcbugs → nobody
...and cjones has let me know that B2G is using e10s, so this is relevant again.
Assignee: nobody → jduell.mcbugs
Blocks: 1217362
Dragana, could you take this bug?  For starters try just unbitrotting the patch and see if it still crashes occasionally on try.  IIRC it was quite mysterious as to why it was happening.

We need this for things like Beacons on e10s (see bug 1217362)
Flags: needinfo?(dd.mozilla)
I will take it.
Flags: needinfo?(dd.mozilla)
Assignee: jduell.mcbugs → dd.mozilla
Status: NEW → ASSIGNED
Attached patch bug_664163.patch (obsolete) — Splinter Review
Attachment #610620 - Attachment is obsolete: true
Attachment #8681198 - Flags: review?(jduell.mcbugs)
Attached patch bug_664163.patch (obsolete) — Splinter Review
Attachment #8681198 - Attachment is obsolete: true
Attachment #8681198 - Flags: review?(jduell.mcbugs)
Attachment #8681201 - Flags: review?(jduell.mcbugs)
Jason, what's the status of this patch?
Flags: needinfo?(jduell.mcbugs)
Attached patch v4Splinter Review
rebased and fixed a couple nits.  Seems fine on try:

   https://treeherder.mozilla.org/#/jobs?repo=try&revision=1f8cc452b675&selectedJob=14756362
Flags: needinfo?(jduell.mcbugs)
Attachment #8700336 - Flags: review+
Attachment #8681201 - Attachment is obsolete: true
Attachment #8681201 - Flags: review?(jduell.mcbugs)
https://hg.mozilla.org/mozilla-central/rev/4dfe96e8f602
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.