Closed Bug 1499307 Opened 6 years ago Closed 6 years ago

Some images are not request/loaded for pages with a high number of images

Categories

(Core :: Networking, defect, P2)

62 Branch
x86_64
Windows 10
defect

Tracking

()

RESOLVED DUPLICATE of bug 1486046

People

(Reporter: michael.kroell, Assigned: u408661)

Details

(Whiteboard: [necko-triaged])

Attachments

(4 files)

Attached image 16_09.35.33.png
User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.9 Safari/537.36

Steps to reproduce:

(1) Browse to
https://geizhals.at/?cat=nb&hloc=at&hloc=de

(2) Increase the number of products shown per page to 1000 in the select box (right next to "Blättern:")

OS: Windows 10 Pro, Version 1803,Build: 17134.345
Firefox 62.0.3 64-Bit 

But we've seen that behaviour also with 
OS: Debian Linux (buster/sid)
Firefox: 60.2.2esr (64-bit)



Actual results:

Some images fail to load. Per reload it is different images which are not loaded and the alt-text is shown instead. Looking at the network console shows that those images are actually not requested from the server at all.

In out tests it did not make a difference whether "Disable Cache" was ticket or not in the developer console


Expected results:

All images should be loaded and display all the time.
OS: Unspecified → Windows 10
Hardware: Unspecified → x86_64
Component: Untriaged → Networking
Product: Firefox → Core
Please try to reproduce with Firefox Nightly first.  I can't reproduce it locally with the Nightly channel.

If this is still persisting even with Nightly, I will ask few more details:
- by "fail to load" you mean that the page load indicator (the throbber) has already stopped spinning while some (product) images were still missing?
- I will ask you for a log according [1], please add following modules to MOZ_LOG:

imgRequest:5

then upload here and also list URL of images that failed to load.

thanks.

[1] https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging#Start_logging_using_command_line_arguments
Flags: needinfo?(michael.kroell)
I can also reproduce the issue on Nightly64.0a1(20181016100106) Linux Mint. (VMPlayer, Guest OS: Linux Mint19 64bit, Host OS: Windows10 Ver1803 64bit)

I cannot reproduce the issue if set MOZ_LOG to timestamp,rotate:200,nsHttp:5,cache2:5,nsSocketTransport:5,nsHostResolver:5,imgRequest:5  (I tried 2 times but I could not reproduce it.)

But, I can reproduce the issue if only set MOZ_LOG to imgRequest:5

(The test was executed after clearing history and cache each time)
(In reply to Alice0775 White from comment #2)
> But, I can reproduce the issue if only set MOZ_LOG to imgRequest:5

Thanks!

Can you please at least do nsHttp:5,imgRequest:5?  And please add image URLs that have failed for you.  Thanks.
Flags: needinfo?(alice0775)
(In reply to Honza Bambas (:mayhemer) from comment #4)
> (In reply to Alice0775 White from comment #2)
> > But, I can reproduce the issue if only set MOZ_LOG to imgRequest:5
> 
> Thanks!
> 
> Can you please at least do nsHttp:5,imgRequest:5?  And please add image URLs
> that have failed for you.  Thanks.

Umm...
I can reproduce the issue if set MOZ_LOG to imgRequest:5, but cannot reproduce set to nsHttp:5,imgRequest:5.

I tried 3 times each, the former is reproduced 3 of 3, but the latter 0 of 3....
Flags: needinfo?(alice0775)
and only nsHttp:5 + the list of URLs that failed?
Flags: needinfo?(alice0775)
Attached file log1of2.tar.bz2
Thanks.  What exactly do yo mean with "first 5" and "last 5" ?
c9
Flags: needinfo?(alice0775)
(In reply to Honza Bambas (:mayhemer) from comment #9)
> Thanks.  What exactly do yo mean with "first 5" and "last 5" ?

When I reproduced the issue, Many images are missing.
So, I picked up 5 missing image urls from page top and 5 missing image urls from page bottom.

Page top
...
...
missing1
missing2
...
missing3
missing4
missing5
...
...
...
...
missing5
...
missing4
missing3
missing2
missing1
...
...
Page bottom
Flags: needinfo?(alice0775)
Hmm.. the URLs you provide are links hrefs to the product pages, not the image URLs that failed to load.

Never mind, I was able to repro locally with the same set of modules you have provided as working for you.
This is an h2 issue or a server issue.  

Nick, please download the log at https://www.janbambas.cz/moz/logs/1499307-hb-1.tar.xz (at the moment still slowly uploading, 12300322 bytes)

Then go to logan (Ctrl-F5 - there were some recent bug fixes!) at https://janbambas.cz/moz/logan/#{"show"%3A[{"name"%3A"nsHttpChannel"%2C"on"%3A2519%2C"clr"%3A"%23ffffb3"}%2C{"name"%3A"nsHttpTransaction"%2C"on"%3A2434%2C"clr"%3A"%23bebada"}%2C{"name"%3A"Http2Stream"%2C"on"%3A2374%2C"clr"%3A"%23fb8072"}]}

And load the log in it.

The stream is closed with 80004004, because we receive a TCP RST.  If you then open the session and scroll up, you'll get to a line `nsHttpConnection::DontReuse 0000027F93A44000 spdysession=0000027F91F14800` because of 804b0014 (net-seset) on the socket.

Question is if we did something wrong or not.  If yes, this bug should have a patch.  If not, then this is another duplicate of bug 1486046.

I would slowly tend to reopening it, since this is really frustrating...
Assignee: nobody → hurley
Flags: needinfo?(michael.kroell)
Priority: -- → P2
Whiteboard: [necko-triaged]
I prepared a simpler document that is also able to trigger the problem initially described by Michael Kroell; you can find it at https://gzhls.at/imgdemo.html (make sure to have the UA's cache cleared after each try to render it)

It includes 1500 small images (all with different URIs) that are served by the same HTTPS server the original problem was observed with.

I had a peek in the servers' logs while the associated requests were processed - interestingly, we register exactly 1000 (one thousand - while I would expect exactly 1500; maybe there is some obscure and artifical limit at work?) requests with the appropriate "Referer" header set, 100% of which our frontend delivers, apparently successfully, with status code 200.

Neither Varnish (the origin for these cached responses, i.e. images) nor nginx (the HTTPS termination point) complain about/log anything abnormal while delivering, while Firefox' developer console shows a large number of requests for image's URIs, which had no responses logged for whatsoever.
Thanks for the log, Honza, as well as the pointers to stuff in logan. Made this much easier to make sense of!

This is, indeed, a duplicate of bug 1486046. To make things even more convoluted, this exact issue with nginx has been discussed in yet another bug (that was temporarily hijacked from its original purpose) - take a look at the few comments starting at https://bugzilla.mozilla.org/show_bug.cgi?id=1050329#c9

I'll give a quick summary here, though - the way nginx shuts down h2 connections that have reached the request limit (which is, by default, 1000) is fundamentally broken, and results in data loss because of the RST that gets sent. Since we receive a RST without a GOAWAY (after sending all requests, including those past the 1000th), there is no way for us to know if the server did anything or not with our requests, and 7540 says (or at least strongly implies) to not retry in that scenario (see my comments in bug 1486046 for more details).

Nginx has a bug on file about the whole broken close thing: https://trac.nginx.org/nginx/ticket/1250#comment:4

Reporter, probably your best thing to do is to significantly increase your http2_max_requests setting in nginx (Firefox, for example, won't roll over a connection until it has created something on the order of 50,000,000 streams - I'm not sure we ever reach that limit in practice). Why they set the default at 1000 is beyond me, it doesn't make sense for a protocol where the connection is designed to be long-lived and carry lots of transactions precisely to avoid the cost of setting up new connections.
Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago
Keywords: regression
Resolution: --- → DUPLICATE
I would still be interested in why this can't be reproduced with default log modules.  Seems like it's causing some significant timing difference that prevents us reaching the limit.
I would imagine that our http stack gets slowed down enough on the send side that we are able to read the GOAWAY out of the kernel buffer (that nginx does send) before we send data causing the server to send a RST (because of nginx's broken tcp stream teardown). Since that gets read before the RST clears the kernel buffer, we notice its existence, and can safely retry the appropriate streams on a new session, and the behaviour doesn't manifest.
> Why they set the default at 1000 is beyond me

I could just add that I've probed nginx about this. The only answer I received was basically "because a customer requested it".

That is of course no explanation to why the default, why the default is so low and why the documentation basically doesn't help nginx users to make an informed decision about what to set the value to. I've seen nginx servers in the wild with a lower max requests value than max allowed concurrent streams!
For the record: After considerably increasing nginx's http2_max_requests setting, the problem's symptoms went away for us.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: