Closed Bug 1136484 Opened 9 years ago Closed 9 years ago

Instagram/YouTube pages incredibly slow to load

Categories

(Core :: Networking: HTTP, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox36 --- unaffected
firefox37 + wontfix
firefox38 + fixed
firefox39 + fixed
firefox-esr31 - unaffected
relnote-firefox --- 39+

People

(Reporter: u408661, Assigned: u408661)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

The last few times I've tried to load an instagram page for viewing (not logged in), it has taken seemingly forever to load - over a minute, sometimes into the 2+ minute range. A quick initial investigation shows that there are 2 head scripts that are apparently being served off akamai that are causing this - both spend over a minute in "connecting" according to the devtools network timeline.

These 2 scripts are the first resources to be loaded from the hostname in question, and the host (instagramstatic-a.akamaihd.net) is speaking spdy/3.1. I'm currently trying to track down a good build to try to figure out when we regressed this (if it is, in fact, our issue). I will also get a pcap to see what we're doing on the wire in case that provides any clues.

For the record, I've seen this behavior both on my mac and on my nexus 5 (multiple networks in that case, one of which is the same as the network the mac is on). I do not see this behavior using chrome on either device.
Further investigation indicates this is ipv6-related - my pcap shows a ton of SYNs sent over ipv6 with no response. Eventually, after 90 seconds, we dial back with ipv4, and everything goes smoothly. I have no problem connecting to other ipv6 sites on this network. I can even reproduce the behavior using telnet:

telnet -6 instagramstatic-a.akamaihd.net 443 # times out
telnet -6 todesschaf.org 443 # connects instantly

so it's obvious that there's something going wrong server-side that's causing IPv6 to just not respond, at all. We should talk to instagram/akamai about that. However, we should probably behave better in a situation like this (chrome obviously does).

Honza, IIRC, you've done some work with our ipv6 stuff, do you have any ideas on where the issue could be on our end?
Flags: needinfo?(honzab.moz)
Can find out where from the connections to instagramstatic-a.akamaihd.net are exactly created?  I think that speculative connections don't use the "happy-eyeballs" thing we have.  See nsHttpConnectionMgr::nsHalfOpenSocket::SetupBackupTimer.  I think that the !mTransaction->IsNullTransaction() doesn't pass and we simply don't do the second backup connection.

Backup connections are ipv4 only and starts 250ms after the regular attempt that was ipv4/6 enabled.
Flags: needinfo?(honzab.moz)
BTW, for me (indeed different since I'm located in EU)

C:\>nslookup
> instagramstatic-a.akamaihd.net
Server:  UnKnown
Address:  fe80::1

Non-authoritative answer:
instagramstatic-a.akamaihd.net  canonical name = instagramstatic-a.akamaihd.net.edgesuite.net
> instagramstatic-a.akamaihd.net.edgesuite.net
Server:  UnKnown
Address:  fe80::1

Non-authoritative answer:
instagramstatic-a.akamaihd.net.edgesuite.net    canonical name = a1168.dsw4.akamai.net
> a1168.dsw4.akamai.net
Server:  UnKnown
Address:  fe80::1

Non-authoritative answer:
a1168.dsw4.akamai.net   AAAA IPv6 address = 2001:5001:100:1c::c31b:f273
a1168.dsw4.akamai.net   AAAA IPv6 address = 2001:5001:100:1c::c31b:f27b
a1168.dsw4.akamai.net   internet address = 23.62.237.86
a1168.dsw4.akamai.net   internet address = 23.62.237.94
>


C:\>ping 2001:5001:100:1c::c31b:f273

Pinging 2001:5001:100:1c::c31b:f273 with 32 bytes of data:
Reply from 2001:5001:100:1c::c31b:f273: time=163ms
Reply from 2001:5001:100:1c::c31b:f273: time=127ms
Reply from 2001:5001:100:1c::c31b:f273: time=121ms
Reply from 2001:5001:100:1c::c31b:f273: time=185ms

Ping statistics for 2001:5001:100:1c::c31b:f273:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 121ms, Maximum = 185ms, Average = 149ms

C:\>ping 2001:5001:100:1c::c31b:f27b

Pinging 2001:5001:100:1c::c31b:f27b with 32 bytes of data:
Reply from 2001:5001:100:1c::c31b:f27b: time=144ms
Reply from 2001:5001:100:1c::c31b:f27b: time=130ms
Reply from 2001:5001:100:1c::c31b:f27b: time=105ms
Reply from 2001:5001:100:1c::c31b:f27b: time=135ms

Ping statistics for 2001:5001:100:1c::c31b:f27b:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 105ms, Maximum = 144ms, Average = 128ms

C:\>ping 23.62.237.86

Pinging 23.62.237.86 with 32 bytes of data:
Reply from 23.62.237.86: bytes=32 time=9ms TTL=58
Reply from 23.62.237.86: bytes=32 time=10ms TTL=58
Reply from 23.62.237.86: bytes=32 time=7ms TTL=58
Reply from 23.62.237.86: bytes=32 time=7ms TTL=58

Ping statistics for 23.62.237.86:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 7ms, Maximum = 10ms, Average = 8ms


Teredo's off..
So these connections are definitely not coming from the predictor's speculative connection implementation - this behavior has been seen on totally fresh profiles, as well as my usual profile. I'll take a peek at the page's markeup and a socket log on a fresh profile to see if there's any indication that we're hitting some other speculative connection path.

The v6 IPs I'm being handed here in CA definitely have some issues... I can't even ping them, unlike your successful attempts. So there's that.

We should still be behaving better in this case, though, whether or not these are speculative connections. We should probably make sure, at a minimum, that (1) we don't try to use a speculative connection that hasn't successfully connected (we may or may not do this already, I have to check), and perhaps (2) we timeout speculative connection attempts after some small number of seconds (certainly less than the 90 it takes my machine to timeout the failed IPv6 attempts).
(In reply to Nicholas Hurley [:hurley] from comment #4)
We should probably make sure, at a
> minimum, that (1) we don't try to use a speculative connection that hasn't
> successfully connected (we may or may not do this already, I have to check),

I think we want the opposite because the alternative is a new connection - and a in-progress connection is closer to being ready than a new connection.

If the speculative connection is missing something we ought to fix that. 

> and perhaps (2) we timeout speculative connection attempts after some small
> number of seconds (certainly less than the 90 it takes my machine to timeout
> the failed IPv6 attempts).

established connections are timed out much faster than reused connections when waiting for a transaction - but that's a timer based on when the socket is ready. I don't see why we should give them less time to connect because of their speculative nature (though it sounds like they may have a bug to be fixed that makes them timeout slower than traditional connects in a HE scenario.)
(In reply to Patrick McManus [:mcmanus] from comment #5)

So yeah, I agree on the tactic - basically I was trying to say that we should have some mechanism to recover from speculative connections (assuming that's the issue here) that are taking way too long. I think we can all agree that waiting on a speculative connection for 90 seconds (that ultimately fails!) is the Wrong Thing. My details may not have been the best, but the end goal is the same.

I'll take this on, since I seem to be the only one (so far) who can reproduce :)
Assignee: nobody → hurley
OK, some further investigation reveals that this is indeed a speculative connection coming from the channel's "open a speculative connection while we wait on the cache" logic, and we don't setup the backup timer because that connection is being driven by a null transaction. This appears to be a regression introduced by bug 1003448, which uses a special case of a null transaction to validate altsvc advertisements. It seems the reasoning there is that, since we're just testing to see if an altsvc advertisement is reasonable, we shouldn't care if the first attempted connection fails. I'm not entirely sure that's valid reasoning (I should have caught this during review) given it could be the case that an altsvc advertisement resolves to multiple IPs, one of which happens to be down at the time we try to validate, however, it's not an especially harmful assumption in the altsvc case, either. This does, however, break our happy eyeballs implementation in quite a few possible cases.

In my view, we're left with two options - (1) remove the prohibition on backup timers from null transactions entirely, which includes allowing them to fire even for altsvc transactions, or (2) keep the prohibition on backup timers for altsvc transactions, but allow them for other variants of null transactions. My leaning is to go with (1), but Patrick may have some other info from writing the altsvc implementation that indicates that (2) is the way to go.

Patrick, does it feel like my inclination above is correct?
Flags: needinfo?(mcmanus)
Quick reference - it's not 100% as straightforward as just removing the "&& !mTransaction->IsNullTransaction()", because that causes a crash on the socket thread, but the page *does* load significantly quicker just before the crash occurs :)
Aaaand found the crash - I missed an assert that we weren't setting up a backup socket on the null transaction :)
(In reply to Nicholas Hurley [:hurley] from comment #9)
> Aaaand found the crash - I missed an assert that we weren't setting up a
> backup socket on the null transaction :)

I suspect there is a reason we don't do that (and therefore a reason for the assert).. any ideas? I need a chance to research it and my hands are occupied with meetings for a while

but on the big picture that we want to remove the restriction - yes that makes sense
Flags: needinfo?(mcmanus)
in connectionmgr::onoutputstreamready you'll find a test on nulltransaction where it will dispatch mTransaction (the bit that talks about null transactions cannot be put in the entry queue) - so the assumption here is that there is no backup connection that would cause the same mTransaction to be dispatched twice.

seems like a bool in the half open could fix it..
Attached patch patch — — Splinter Review
I'm pretty sure this is what you were suggesting to do with the bool in the half-open, but I could be totally wrong there. It definitely solves my instagram problem, though (finally, I can view people's hipster-filtered pictures without waiting a minute and a half!)

Try is happy with it, nonetheless https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4f76127e38f
Attachment #8573346 - Flags: review?(mcmanus)
Comment on attachment 8573346 [details] [diff] [review]
patch

Review of attachment 8573346 [details] [diff] [review]:
-----------------------------------------------------------------

I'm sorry that I forgot this was in my queue. lgtm.
Attachment #8573346 - Flags: review?(mcmanus) → review+
Blocks: MSE
Summary: Instagram pages incredibly slow to load → Instagram/YouTUbe pages incredibly slow to load
[Tracking Requested - why for this release]: This is a blocker for YouTube/MSE
Summary: Instagram/YouTUbe pages incredibly slow to load → Instagram/YouTube pages incredibly slow to load
https://hg.mozilla.org/mozilla-central/rev/d445a5fb748d
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Given that this is a blocker for MSE, can you please request uplift for Beta and Aurora?
Flags: needinfo?(hurley)
can someone help me understand how this is related to MSE? (are we using different yt servers with v6 properties perhaps?)
I find myself wondering the same thing... how is this related to MSE? For example, on the exact same ipv6-enabled connection where I had the instagram issues that caused me to file (and fix) this bug, I have never seen any issues with MSE on youtube. Furthermore, I took a look at the original pcap in bug 1135078 and didn't see any obvious happy eyeballs issues (all the ipv6 connections I saw in that pcap succeeded just fine, with no delays). Others seem to have provided pcaps with happy eyeballs issues, though, so perhaps there are two bugs going on with youtube, one being this bug, and yet another being the original report over in bug 1135078?
Flags: needinfo?(hurley)
(In reply to Nicholas Hurley [:hurley] from comment #21)
.
> Others seem to have provided pcaps with happy eyeballs issues, though, so
>

what I'm trying to figure out is how any happy eyeball bug is related to MSE.. to youtube I can understand, but why MSE and not the current yt approach?
(In reply to Patrick McManus [:mcmanus] from comment #20)
> can someone help me understand how this is related to MSE? (are we using
> different yt servers with v6 properties perhaps?)

This is a question for Anthony.
Flags: needinfo?(ajones)
(In reply to Patrick McManus [:mcmanus] from comment #22)
> (In reply to Nicholas Hurley [:hurley] from comment #21)
> .
> > Others seem to have provided pcaps with happy eyeballs issues, though, so
> >
> 
> what I'm trying to figure out is how any happy eyeball bug is related to
> MSE.. to youtube I can understand, but why MSE and not the current yt
> approach?

Just for the record, I was agreeing with you on this (and also pointing out that there may, in fact, be two bugs at work here).
Dragana - why is bug 1135078 a duplicate of this?
Flags: needinfo?(dd.mozilla)
It is the same problem. 
The log that i have received in https://bugzilla.mozilla.org/show_bug.cgi?id=1135078#c51 shows that and the tcpdumps submitted in the bug as well.
The IPv6 address on the server does not work and we should fall back to ipv4, happy eyeballs, but we do not.
The happy-eyeballs is not started so we wait for the connection to the ipv6 address to timeout before trying again.
The same problem as in this bug.

The tcpdump for IE shows that they try second connection, to ipv4 address, right after the first one (happy eyeballs) and therefor it is working better on IE.
Flags: needinfo?(dd.mozilla)
Lawrence - the issue is that YouTube's IPv6 stuff is shonky but our fallback wasn't working well. This bug fixes our fallback path but it would still be nice for YouTube to fix their IPv6 stuff as well.
Flags: needinfo?(ajones)
(In reply to Dragana Damjanovic [:dragana] from comment #26)
> It is the same problem. 
> The log that i have received in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1135078#c51 shows that and the
> tcpdumps submitted in the bug as well.
> The IPv6 address on the server does not work and we should fall back to
> ipv4, happy eyeballs, but we do not.
> The happy-eyeballs is not started so we wait for the connection to the ipv6
> address to timeout before trying again.
> The same problem as in this bug.
> 
> The tcpdump for IE shows that they try second connection, to ipv4 address,
> right after the first one (happy eyeballs) and therefor it is working better
> on IE.

This doesn't entirely jive with my viewing of https://bugzilla.mozilla.org/attachment.cgi?id=8575425 (one of the pcaps originally attached to the bug) - there were plenty of ipv6 connections, to youtube addresses, that had zero problems connecting. To be sure, there were ipv6 connections in that pcap that had happy eyeballs problem, but none of them appear to be to google/youtube networks. This is why I'm wondering if there aren't two bugs at work here.
for lawrence's benefit: it certainly seems like there is a HE issue sometimes with YT that this fixes. But I haven't seen any reason to suggest that it is tied to EME instead of YT in general. I'm making a point of this because I think there is a small general networking risk involved in backporting the patch.

(oh and it might not be a matter of YT fixing their v6, its likely a problem on the path somewhere - often at the reporter's house :))
At this point in the schedule I don't want to take this fix unless we think we absolutely have to. It's not clear to me from the recent comments whether that is the case. Sorry for the triple ni but I need some guidance here.
Flags: needinfo?(mcmanus)
Flags: needinfo?(hurley)
Flags: needinfo?(ajones)
do we have a yt related STR? If so get the 2x2 matrix of patch and mse and see if this really blocks mse. I don't have v6 available to me so I can't do that. Its not obvious to me why it would be linked to mse, but like I said before there might be some info I don't have (e.g. mse uses different servers, or makes more connections so its more pronounced, or? ??) - but my inclination is that it was simply present in some of the mse traces and would also have been present in non mse traces too.

I'm in a bit of an awkward place saying that there is some risk in the backport - because obviously I think the patch is correct or I wouldn't have r+'d it. But at the same time there is some complexity in the code that it is touching and its only going to be completely exercised by a combination of timing and network-error conditions no matter what testing we try (that's the nature of the code).

so if it is just fixing a long standing problem that is independent of mse, let's just backport it 1 channel instead of 2. But if they are linked then lets go for the full backport.
Flags: needinfo?(mcmanus)
mcmanus said it much better than I could have, so I'll just +1 his response :)
Flags: needinfo?(hurley)
At this point we're not going to take this fix in 37 unless someone tells me why it is a stop ship bug. I have marked 37 as wontfix.

Once this has been on m-c for a few more days we can consider this for uplift to 38. If you can provide some guidance on what should be tested to ensure that this change has not introduced regressions, I can ask our QA team to perform some exploratory testing on the change.
Nicholas, if you think we should have in 38, could you fill the uplift request (aurora)? Thanks
Flags: needinfo?(hurley)
Flags: needinfo?(ajones)
Comment on attachment 8573346 [details] [diff] [review]
patch

Approval Request Comment
[Feature/regressing bug #]: bug 1003448
[User impact if declined]: requests that initially go to down/slow ipv6 servers will take approximately 90 seconds to fallback to ipv4 (and thus to get a response)
[Describe test coverage new/current, TreeHerder]: on m-c for about a week, manual testing against instagram
[Risks and why]: small risk of general networking issues due to the timing and network condition-related nature of this bug and fix
[String/UUID change made/needed]: none
Flags: needinfo?(hurley)
Attachment #8573346 - Flags: approval-mozilla-aurora?
Comment on attachment 8573346 [details] [diff] [review]
patch

Maybe we should write a non reg test for this if possible?
Attachment #8573346 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Release Note Request (optional, but appreciated)
[Why is this notable]: A nice improvement for users
[Suggested wording]: Improve performance for ipv6 fallback to ivp4 
[Links (documentation, blog post, etc)]:

If you have a suggestion to improve the wording of the note, please comment and n-i me. "Makes Instagram not be slow" seemed inelegant.
We don't have IPv6 enabled for our network and it's not worth setting it up, especially since this was already verified by Nicholas. Removing the qe-verify flag.
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: