Closed Bug 1374510 Opened 7 years ago Closed 5 years ago

videos on twitter get stuck returning 404s

Categories

(Core :: Networking, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
platform-rel --- -

People

(Reporter: bkelly, Unassigned)

References

Details

(Whiteboard: [platform-rel-Twitter] [necko-next])

Attachments

(1 file)

Attached file log.txt-main.1716.gz
Many people have noticed lately on twitter that GIFs (converted to mp4 by twitter) start showing "Media could not be played" on firefox.  The same tweet will load and play just fine on chrome.  It will even load and play just fine in another instance of firefox.  No matter how many times the tweet is reloaded, though, it will not receive a different status code in the current profile instance.

Its unclear to me what is going on.  I'm putting this in necko because the request shows up in netmonitor and gives a 404.  I guess its possible its audio/video code somehow doing the caching as well.

I've attached an http log of the failed load.
platform-rel: --- → ?
Whiteboard: [platform-rel-Twitter]
In the log, I do not see anything strange. The server returns 404. We should contact twitter about this.
(In reply to Dragana Damjanovic [:dragana] from comment #1)
> In the log, I do not see anything strange. The server returns 404. We should
> contact twitter about this.

We have and they are looking at an issue in their CDN.

The reason I raised this, though, is that it seems the problem affects FF more than other browsers.  And once a particular browser instance hits it we seem to require a restart to fix it.

Could this be very long DNS caching keeping us on a bad CDN node while other browsers might re-evaluate DNS more frequently?
Blake - can we propagate the 404 to the UI to make it more obvious from the outside that it is a broken CDN?
Flags: needinfo?(bwu)
I'm getting this right now in my current nightly session.  Browser console shows:

"HTTP load failed with status 404. Load of media resource https://video.twimg.com/tweet_video/DDBfPlsW0AI_4lC.mp4 failed."

When trying to open:

https://twitter.com/anatudor/status/878306682509635584

If I load the twitter URL or the mp4 URL in a different container it loads fine.  If I load either one in the same container as my original twitter session I get a 404.

Any ideas why the container is effecting this?  I thought http cache was cross-origin.
Do we have another level caching in the media layer thats based on principal?
AFAICT that browser console message suggests its necko that is originating the 404 error code:

http://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#535
Andrea, do you have any ideas why a necko channel in a different container would get a different response?  AFAICT there is no CacheStorage data for the page, so I don't think a service worker is changing things.  (Also hard refresh doesn't help.)

Is there some cache that keys on the container principal?
Flags: needinfo?(amarchesini)
(In reply to Anthony Jones (:kentuckyfriedtakahe, :k17e) from comment #3)
> Blake - can we propagate the 404 to the UI to make it more obvious from the
> outside that it is a broken CDN?
Yes. It should doable. I have created bug 1376004 to further check.
Flags: needinfo?(bwu)
from the log that i looked at, necko has sent a request to a server and 404 was the response. From the log, it also checks the cache and finds 404 in the cache and goes to the network.

Do we send different headers, do requests look different? are responses that are not 404 served from the cache or from the network?
¡Hola Ben!

I'm unable to make https://twitter.com/i/videos/tweet/878002514389606400 play on Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0 ID:20170623115718 CSet: 594cc32b632396a867ef1f98428968b224d82151 so commenting here before filing another bug.

This is what I see in the "Browser Console" (sorry about the es-MX):

[SW] on message <unavailable>  js/lib/push_worker.js:83
Content Security Policy: Ignorar ‘x-frame-options’ por la directiva ‘frame-ancestors’.  (desconocido)
Content Security Policy: La directiva 'frame-src' está desaprobada. Usa la directiva 'child-src' en su lugar.  (desconocido)
Content Security Policy: Las opciones de esta página han bloqueado la carga de un recurso en self ("script-src https://twitter.com http://localhost:* http://localhost.twitter.com:* https://*.twitter.com https://*.twimg.com https://vine.co https://*.vine.co https://*.periscope.tv https://*.pscp.tv"). Source: onfocusin attribute on DIV element.  878002514389606400
Content Security Policy: Las opciones de esta página han bloqueado la carga de un recurso en self ("script-src https://twitter.com http://localhost:* http://localhost.twitter.com:* https://*.twitter.com https://*.twimg.com https://vine.co https://*.vine.co https://*.periscope.tv https://*.pscp.tv"). Source: call to eval() or related function blocked by CSP.  build.min.js:148
[SW] on message <unavailable>  js/lib/push_worker.js:83

Am I seeing this bug or a completely different one?

¡Gracias!
Alex
Flags: needinfo?(bkelly)
I think those console messages are unrelated.  Look for a 404 in the network monitor for the MP4.
Flags: needinfo?(bkelly)
See Also: → 1376067
Ban, can you still reproduce this? do you know answers to the questions from comment 9.
Flags: needinfo?(bkelly)
Unfortunately I had restarted my session by the time I saw the questions.  If it happens again I will check.
Flags: needinfo?(bkelly)
Whiteboard: [platform-rel-Twitter] → [platform-rel-Twitter] [necko-next]
Ben, I think your DNS theory may be correct.  Unfortunately the log is too short, but I can confirm the 404 comes from the server on an h2 session, which is something we may keep literally forever during a browser session if it seems fit.  If the resources have been moved, we would still try to reach them on the same physical server thanks the long-lasting h2 session connection.  Hence, this is not something DNS related, IMO.

When you've opened a new container window we have created a new session to probably a different (updated IP address) because connections and h2 sessions are isolated by origin attributes (according my knowledge, but Patrick would know better how much we coalesce here.)  about:networking would say better!  Try to keep it mind the next time to look there in case of these problems ;)
What would I look for in about:networking to tell if its the h2 persistent connection issue?

Also, the last time this happened I managed to recover from it by hitting ctrl-shift-r for a hard reload.  Is that expected to start a new h2 connection?

Thanks!
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #15)
> What would I look for in about:networking to tell if its the h2 persistent
> connection issue?

First, you have to find out if the host is h2, it's in the HTTP section.

Second, you need to find out the host->IP as firefox sees it, under DNS.

Third, check there is a socket to that IP under Sockets to confirm we connect to that host via that IP.

Then you can compare what system nslookup call (uncached) returns for the DNS name.  If it differs and Ctrl-F5 helps, it confirms the dns change theory.

> 
> Also, the last time this happened I managed to recover from it by hitting
> ctrl-shift-r for a hard reload.  Is that expected to start a new h2
> connection?

Yes.

> 
> Thanks!
I just experienced this again on https://twitter.com/TPS_ParkingPal/status/882249990097260544. I pressed Ctrl-Shift-r, got the initial frame (or whatever) of the video, then got the black box with "The media could not be played." In the console I see "HTTP load failed with status 404. Load of media resource https://video.twimg.com/tweet_video/DD5hrtRXUAIZe30.mp4 failed."
(In reply to Andrew Overholt [:overholt] from comment #17)
> I just experienced this again on
> https://twitter.com/TPS_ParkingPal/status/882249990097260544. I pressed
> Ctrl-Shift-r, got the initial frame (or whatever) of the video, then got the
> black box with "The media could not be played." In the console I see "HTTP
> load failed with status 404. Load of media resource
> https://video.twimg.com/tweet_video/DD5hrtRXUAIZe30.mp4 failed."

And have you done anything I mentioned in comment 16 before you've pressed Ctrl-Shift-r?
I'm experiencing this problem intermittently as well.  In about:networking I inevitably see under "HTTP" that neither of the 2 "video.twimg.com" connections (I'm assuming one is credentialed and one is not?) are Active (AKA they have an "Active" value of "0").

While attempting to repro, images via pbs.twimg.com just broke the same way (no active connections) and I'm seeing suspicious things in the log that the "spdy" connection has decided that it can use its existing connection to video.twimg.com like it was a pbs.twimg.com connection:

(all of this is on the socket thread)
"""
V/nsHttp nsHttpConnectionMgr::OnMsgSpeculativeConnect [ci=.S.....pbs.twimg.com:443]
V/nsHttp FindCoalescableConnection .S.....pbs.twimg.com:443
...
V/nsHttp joinconnection [0x7f4331042000 .S.....video.twimg.com:443] pbs.twimg.com:.443 result=1 cache
V/nsHttp FindCoalescableConnectionByHashKey() found match conn=0x7f4334e81e50 key=104.244.46.103~.:443/[]viaDNS newCI=.S.....pbs.twimg.com:443 matchedCI=.S.....video.twimg.com:443 join ok
V/nsHttp FindCoalescableConnection(.S.....pbs.twimg.com:443) match conn 0x7f4334e81e50 on dns key 104.244.46.103~.:443/[]viaDNS
V/nsHttp GetSpdyActiveConn() request for ent 0x7f4348326470 .S.....pbs.twimg.com:443 found an active connection 0x7f4334e81e50 in the coalescing hashtable
V/nsHttp FindCoalescableConnection .S.....pbs.twimg.com:443
V/nsHttp nsHttpConnection::SetupSSL 0x7f4334e81e50 caps=0x29 .S.....video.twimg.com:443
D/nsSocketTransport JIMB: ReleaseFD_Locked: mFDref = 2
I/nsHttp JoinConnection 0x7f4331042000 no origin frame check used.
V/nsHttp joinconnection [0x7f4331042000 .S.....video.twimg.com:443] pbs.twimg.com:.443 result=1 cache
V/nsHttp FindCoalescableConnectionByHashKey() found match conn=0x7f4334e81e50 key=104.244.46.103~.:443/[]viaDNS newCI=.S.....pbs.twimg.com:443 matchedCI=.S.....video.twimg.com:443 join ok
V/nsHttp FindCoalescableConnection(.S.....pbs.twimg.com:443) match conn 0x7f4334e81e50 on dns key 104.244.46.103~.:443/[]viaDNS
V/nsHttp GetSpdyActiveConn() request for ent 0x7f4348326470 .S.....pbs.twimg.com:443 found an active connection 0x7f4334e81e50 in the coalescing hashtable
V/nsHttp nsHttpConnectionMgr::RestrictConnections 0x7f4348326470 .S.....pbs.twimg.com:443 restricted due to active >=h2
V/nsHttp OnMsgSpeculativeConnect Transport not created due to existing connection count
...
...
V/nsHttp Spdy Dispatch Transaction via Activate(). Transaction host = pbs.twimg.com, Connection host = video.twimg.com
"""

I have logs from the general time interval, but this is on my main profile.  If the above isn't enough of a smoking gun, I can provide log excerpts here or provide the logs in their entirety privately.
Here's a trace on a GET of https://pbs.twimg.com/media/DEALffQVwAA4ecM.jpg that failed from request to response, hopefully only boring things elided:

(Main Thread)
V/nsHttp nsHttpTransaction::Init [this=0x7f431c52e000 caps=29]
I/nsHttp http request [
I/nsHttp   GET /media/DEALffQVwAA4ecM.jpg HTTP/1.1
I/nsHttp   Host: pbs.twimg.com
I/nsHttp   User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0
I/nsHttp   Accept: */*
I/nsHttp   Accept-Language: en-US,en;q=0.5
I/nsHttp   Accept-Encoding: gzip, deflate, br
I/nsHttp   Referer: https://twitter.com/
I/nsHttp   Connection: keep-alive
I/nsHttp ]

(Socket Thread)
V/nsHttp nsHttpConnectionMgr::TryDispatchTransaction without conn [trans=0x7f431c52e000 halfOpen=(nil) conn=(nil) ci=0x7f4348798d40 ci=.S.....pbs.twimg.com:443 caps=29 tunnelprovider=(nil) onlyreused=0 active=0 idle=0]
V/nsHttp FindCoalescableConnection .S.....pbs.twimg.com:443
V/nsHttp nsHttpConnection::SetupSSL 0x7f4334e81e50 caps=0x29 .S.....video.twimg.com:443
D/nsSocketTransport JIMB: ReleaseFD_Locked: mFDref = 2
I/nsHttp JoinConnection 0x7f4331042000 no origin frame check used.
V/nsHttp joinconnection [0x7f4331042000 .S.....video.twimg.com:443] pbs.twimg.com:.443 result=1 cache
V/nsHttp FindCoalescableConnectionByHashKey() found match conn=0x7f4334e81e50 key=104.244.46.103~.:443/[]viaDNS newCI=.S.....pbs.twimg.com:443 matchedCI=.S.....video.twimg.com:443 join ok
V/nsHttp FindCoalescableConnection(.S.....pbs.twimg.com:443) match conn 0x7f4334e81e50 on dns key 104.244.46.103~.:443/[]viaDNS
V/nsHttp GetSpdyActiveConn() request for ent 0x7f4348326470 .S.....pbs.twimg.com:443 found an active connection 0x7f4334e81e50 in the coalescing hashtable
V/nsHttp    dispatch to spdy: [conn=0x7f4334e81e50]
V/nsHttp nsHttpConnectionMgr::DispatchTransaction [ent-ci=.S.....pbs.twimg.com:443 0x7f4348326470 trans=0x7f431c52e000 caps=29 conn=0x7f4334e81e50 priority=10]
V/nsHttp Spdy Dispatch Transaction via Activate(). Transaction host = pbs.twimg.com, Connection host = video.twimg.com
V/nsHttp nsHttpConnection::Activate [this=0x7f4334e81e50 trans=0x7f431c52e000 caps=29]
V/nsHttp nsHttpConnection::AddTransaction for SPDY
V/nsHttp nsHttpConnectionMgr::AddActiveTransaction    t=0x7f431c52e000 tabid=1800003dd(1) thr=0
V/nsHttp Active transactions +[5,0,3,0]
V/nsHttp nsHttpConnectionMgr::EnsureThrottleTickerIfNeeded
V/nsHttp nsHttpConnectionMgr::IsThrottleTickerNeeded
V/nsHttp   there are unthrottled transactions for both active and bck
I/nsHttp Http2Stream::Http2Stream 0x7f431fb46470
I/nsHttp Http2Session::AddStream session=0x7f4331042000 stream=0x7f431fb46470 serial=3574 NextID=0x21 (tentative)

(Socket Thread)
V/nsHttp nsHttpTransaction::WriteSegments 0x7f431c52e000
I/nsHttp Http2Stream::OnWriteSegment 0x7f431fb46470 count=32768 state=4 0x23
I/nsHttp Http2Session::ResetDownstreamState() 0x7f4331042000
I/nsHttp Http2Session::ChangeDownstreamState() 0x7f4331042000 from 7 to 1
V/nsHttp nsHttpTransaction::ProcessData [this=0x7f431c52e000 count=145]
V/nsHttp nsHttpTransaction::ParseHead [count=145]
V/nsHttp nsHttpTransaction:: Identified HTTP/2.0 treating as 1.x
V/nsHttp nsHttpTransaction::ParseLine [HTTP/2.0 404]
V/nsHttp nsHttpResponseHead::ParseVersion [version=HTTP/2.0 404]
V/nsHttp response status line needs default reason phrase
V/nsHttp Have status line [version=20 status=404 statusText=Not Found]
V/nsHttp nsHttpTransaction::ParseLine [content-type: text/html]
V/nsHttp ParseContentType [type=text/html]
V/nsHttp nsHttpTransaction::ParseLine [date: Fri, 07 Jul 2017 00:10:42 GMT]
V/nsHttp nsHttpTransaction::ParseLine [server: ECAcc (nya/7897)]
V/nsHttp nsHttpTransaction::ParseLine [content-length: 345]
V/nsHttp nsHttpTransaction::ParseLine [X-Firefox-Spdy: h2]
V/nsHttp nsHttpTransaction::HandleContent [this=0x7f431c52e000 count=0]
V/nsHttp nsHttpTransaction::HandleContentStart [this=0x7f431c52e000]
I/nsHttp http response [
I/nsHttp   HTTP/2.0 404 Not Found
I/nsHttp   content-type: text/html
I/nsHttp   date: Fri, 07 Jul 2017 00:10:42 GMT
I/nsHttp   server: ECAcc (nya/7897)
I/nsHttp   content-length: 345
I/nsHttp   X-Firefox-Spdy: h2
I/nsHttp     OriginalHeaders
I/nsHttp   content-type: text/html
I/nsHttp   date: Fri, 07 Jul 2017 00:10:42 GMT
I/nsHttp   server: ECAcc (nya/7897)
I/nsHttp   content-length: 345
I/nsHttp   X-Firefox-Spdy: h2
I/nsHttp ]
V/nsHttp nsHttpTransaction::CheckForStickyAuthScheme this=0x7f431c52e000
V/nsHttp nsHttpConnection::OnHeadersAvailable [this=0x7f4334e81e50 trans=0x7f431c52e000 response-head=0x7f431bc47140]
V/nsHttp Connection can be reused [this=0x7f4334e81e50 idle-timeout=170sec]
V/nsHttp nsHttpTransaction::HandleContent [this=0x7f431c52e000 count=0 read=0 mContentRead=0 mContentLength=345]
Patrick, can you read anything from the last two comments?  Andrew unfortunately still didn't confirm the DNS theory by looking at the socket address and system nslookup result, tho.
Flags: needinfo?(mcmanus)
I have tried to access that image with pbs.twimg.com and video.twimg.com and both return a valid image. Then I have changed my host dns table to resolve pbs.twimg.com and video.twimg.com into 104.244.46.103 and test it again. Firefox do resolve this 2 addresses into that IP address and I can still get that image from both.
I don't see anything in that bug that confirms that persistency is the issue. Both H1 and H2 close after a few minutes of idle. (they differ a little bit, but not by much). The workarounds - hard reloads and container swaps - all involve creating new connections however.

I'm going to speculate (but informed speculation) that twitter is having some trouble with rfc 7540 coalescing on their CDN. Their are several posts on the matter, but when you have two origins that have shared a resolved IP address and have overlapping certificates one connection can be reused for the other origin. We do this a little more aggressively that chrome, though within spec and in a way that is a performance win for twitter users.

The server can deal with this in a few ways

a] instead of returning 404 (which is an authoritative failure - its says the resource does not exist), return 421 (Misdirected Request). 421 tells a coalescing client to silently retry without coalescing. A non coalescing client (or one who simply wasn't coalescing) treats it as a 4xx the same way as a 404.

b] use dns in such a way that hosts which serve different sets of origins do not have overlapping dns sets

c] do not use certificates valid for origins for which the server does not have the content. Returning 404 for a origin for which you are using a valid certificate is an authoritative response that the content does not exist.

[a] is generally the easiest thing for a quick fix.
Flags: needinfo?(mcmanus)
(In reply to Honza Bambas (:mayhemer) from comment #21)
> Patrick, can you read anything from the last two comments?  Andrew
> unfortunately still didn't confirm the DNS theory by looking at the socket
> address and system nslookup result, tho.

Apologies, I started out pursuing the comment 2 DNS theory with the comment 16 advice but found nothing to support it and should have been more specific about that.  :mcmanus' comment 23 seems consistent with what I've observed 

I run Firefox Nightly with 2 concurrent profiles (main and "okay if it breaks").  Whenever the problem occurred in my main profile, I inevitably found that the "HTTP" about:networking list showed that neither "video.twimg.com" connection was active, there would be a DNS listing (usually with 1 IPv4 address and one IPv6 entry).  I would then open that specific tweet page in the other profile and it would successfully load the gif-y mp4.  About 50% of the time, the alternate profile would end up connecting to the same IPv4 address from the main profile's DNS entry.  It experienced no problems.  I would then control-r in the main profile and either (usually) reproduce the 404 or sometimes the mp4 would load and then about:networking "HTTP" would show at least one of the connections was now "Active" where previously it was not.  (Note that I would be refreshing the specific tweet page, not the timeline.)  control-shift-r in the same "main" profile experiencing breakage reliably fixed things, resulting in an active connection with an active socket *with the same DNS entry as the failed case* as far as I could see.

I also did things where I waited for both profiles to let their DNS resolutions time out then attempt a reload in both so that they got the same DNS (which was consistent with my nslookup and dig checks at roughly the same time).  I additionally "pinned" video.twimg.com in my /etc/hosts for some time and was still able to reproduce the problem.  Directly loading the .mp4 in the URL bar in my main profile when experiencing the problem also showed the 404 (although that may have been out of cache).  Note that while DNS lookups would fluctuate a bit, I seemed to generally end up being served from a small set of servers.  At my most diligent logging and reproducing between the two profiles, I got this set of dig results which were perceived by both profiles:

;; ANSWER SECTION:
video.twimg.com.	261	IN	CNAME	cs296.wpc.edgecastcdn.net.
cs296.wpc.edgecastcdn.net. 2416	IN	CNAME	cs2-wpc.apr-8315.edgecastdns.net.
cs2-wpc.apr-8315.edgecastdns.net. 261 IN CNAME	cs2-wpc-us.8315.ecdns.net.
cs2-wpc-us.8315.ecdns.net. 261	IN	CNAME	cs531.wpc.edgecastcdn.net.
cs531.wpc.edgecastcdn.net. 3596	IN	A	192.229.210.163


And this was what about:networking's HTTP and DNS tab returned in a bad case in my main profile:
video.twimg.com	443	true	true	0	0
video.twimg.com	443	true	true	0	0

video.twimg.com	ipv4	192.229.210.163
2606:2800:21f:dc2:1fe1:23fc:954:1461
	65

(no socket entry)


Then after a shift-control-r, same profile:
video.twimg.com	443	true	true	0	0
video.twimg.com	443	true	true	0	0

video.twimg.com	ipv4	192.229.210.163
2606:2800:21f:dc2:1fe1:23fc:954:1461
	114


And in the alternate profile after a soft-refresh resulting in DNS convergence:
video.twimg.com	ipv4	192.229.210.163
	110
video.twimg.com	ipv4	192.229.210.163
2606:2800:21f:dc2:1fe1:23fc:954:1461
	113

socket:
192.229.210.163	443	true	true	548	647329
It's also worth noting that when ctrl-r would fix me in the same profile, it had likely been on the order of minutes that had passed because I was futzing around with about:networking in my main profile and attempting to reproduce in the alternate profile instance, etc.  So it's very possible the coalesced connection itself had been dropped as well.
(In reply to Andrew Sutherland [:asuth] from comment #24)
> Then after a shift-control-r, same profile:
> video.twimg.com	443	true	true	0	0
> video.twimg.com	443	true	true	0	0
> 
> video.twimg.com	ipv4	192.229.210.163
> 2606:2800:21f:dc2:1fe1:23fc:954:1461
> 	114

Shoot, apologies, I just realized this bit is confusing here.  I think this was a case where things were still broken after a shift-control-r, but my notes are ambiguous.  It definitely was the case that all profiles were using the same IP, however.
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P2
platform-rel: ? → -
Honza - can you comment on whether we need more evidence, and what kind in order to close this issue out?
Flags: needinfo?(honzab.moz)
As there are no "me too" or duplicates reported, I believe this has been fixed as an issue on the twitter side.  Patrick's comment 23 and Andrew's comment 24 IMO confirm that it was a coalescing problem (which we are more aggressive at than other browsers - hence Firefox broken, others work) while twitter's CDNs were not using 421 properly.
Flags: needinfo?(honzab.moz)
Resolving WORKSFORME based on comment #30.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
Flags: needinfo?(amarchesini)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: