Closed Bug 61007 Opened 24 years ago Closed 24 years ago

Getting newsgroups is much slower than other programs.

Categories

(MailNews Core :: Networking: NNTP, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

VERIFIED FIXED
mozilla0.9

People

(Reporter: jhayryne, Assigned: sspitzer)

Details

(Keywords: perf, Whiteboard: fix in hand)

Attachments

(4 files)

(build 2000112208)

i'm on a 640k/7mbit adsl line, my isp has about 35k groups on it's news server
and the group file is a little over 1.6megabytes.

with "tin", it takes less than 5 seconds to fetch the groups, network meter
shows speeds around 800kbytes/s when getting the groups.

with mozilla, the groups come about 20kbytes/s and it takes several (2-3)
minutes to refresh the group list. during the refresh, cpu load is near 0%. why
is this so slow, when cpu speed and network speed are clearly not the issue?
What would be interesting would be to see a trace of the commands sent to the 
news server. Is it possible for you to use tcpdump or something similar to log 
what's happening? Maybe tin uses some other (more efficient) method than 
mozilla.
Wow. That was quick. Is it possible to get the output as ascii chars too? I 
think I have seen it nicely formatted somewhere, but maybe that was another 
program than tcpdump. I find it a little hard to read the hex code.

Anyway, one big difference is that tin's tcp blocks all are 1448 bytes but 
mozilla's block differs in size all the time but are around 900-1000 bytes most 
of the time. It's clear that tin's transmission looks more efficient, but what 
causes the size of the packets to vary is beyond my knowledge. I will cc 
gagan@netscape.com who might know why that is. I assume that news use necko for 
the network communication?

Taking that as an target for this bug, I will confirm it.
Status: UNCONFIRMED → NEW
Ever confirmed: true
http://xxx.ath.cx/tcpdump-tin-ascii
http://xxx.ath.cx/tcpdump-mozilla-ascii

i hope they're of use, i don't know what to make of them :)))
Changing QA contact to Daniel Bratell, thanks!
QA Contact: esther → bratell
For purpose of being able to see this bug in a special query, I am re-assigning
QA to myself, but Janne, please feel free to keep on this...
QA Contact: bratell → stephend
If I could read the traces correctly, tin do:
MODE READER\r\n
XOVER\r\n     (server answered "Not in a newsgroup")
LIST OVERVIEW\r\n (server answers with some data)
LIST\r\n

While for mozilla only a
LIST\r\n

is shown. I don't now if it's truncated but there does seem to be some
differences. What's XOVER? What's MODE READER? I don't know if this affects the
performance, but it would be interesting to know what would happen if mozilla
sent the same commands before the LIST if it doesn't already.

Janne, could you try to see if you can start the trace for mozilla earlier
before the connection is done (which I don't know when it's done) so that we can
see what mozilla sends. Alternatively, maybe sspitzer can tell us.
did new asciidumps, see urls above..

but one thing should be noted: when mozilla is fetching the groupslist, the tcp
receive queue from nntpserver is constantly full. so there is data to be read,
but for some reason we do it slowly.
therefore it might not be so much of a protocol issue, but related to the way
mozilla reads and handles data from the socket.
I see in your last trace that we don't do anything else but LIST. I wonder what 
MODE READER and XOVER does. I will try to remember to look it up. 

That the network buffer is full could very much cause the things we see, with 
varying size on the packets. So, who's to blame? Are we not checking the buffers 
often enough? Are they too small? Can we increase the size of them and test? 

*changed summary*

Btw, how/where did you see that the buffers are getting full?
Summary: getting newsgroups is agonizingly slow. → Getting newsgroups is much slower than other programs (too small network buffer?).
netstat
the Recv-Q in netstat shows bytes in the kernels tcp buffer that have not been
read by the program that owns the socket. mine shows constantly that there are
over 55kbytes in the buffer. so we're just not reading the data from the socket
as fast as we could.
removing "too small network buffer?" from subject since it's kind of irrelevant
and not in mozillas control. the problem is not that the buffer is too small,
but that we don't read it fast enough.

a thing to add to my previous comment: with tin, the "Recv-Q" in netstat shows
that there's either 0 or max 2 kilobytes waiting in the buffer when you check it
continuously during the fetch, which means kernel can keep accepting data to the
buffer all the time from the other host.
Summary: Getting newsgroups is much slower than other programs (too small network buffer?). → Getting newsgroups is much slower than other programs.
Adding self to CC to keep track of bug.
I've done some investigation and it seems like we can't add groups quicker than
40000 groups / second and that's if there's no code overhead at all and the
timer is very exact.

nsNNTPProtocol::ReadNewsList is controlled by a timer and it won't take anymore
than 20 groups at a time. After 20 groups it starts a new timer at 50
microseconds (20 groups * 2000 fired timers = 40000 groups). Since there is
delays I think there might be only 10-100 timers fired every second maximum
which might be one big reason for the long update time compared to for instance
tin.

Also, the NET_IS_SPACE macro in that function is using an extraordinary amount
of time (49% of the time according to Quantify). I wonder if the isspace
function (which the macro uses) might be locale sensitive and therefore much
slower than checking for ' ' or '\t' directly. I will try that change too.
I changed READ_NEWS_LIST_COUNT_MAX from 20 to 100 and replaced the isspace with
a check for ' '  or '\t'. That reduced the time to read the group list from 70
to 25 seconds. I didn't see any unresponsiveness after that change. My computer
is a Celeron running at 375 MHz and I have 512 kbps to the newsserver.

I will continue to experiment.

jhayryne, do you do your own builds and can check if you get the same result?
Keywords: perf
Assignee: sspitzer → bratell
I tried some other values of READ_NEWS_LIST_COUNT_MAX. Remember that 20-25
seconds is the theoretical maximum with my net speed:

READ_NEWS_LIST_COUNT_MAX    Time to refresh
============================================
   20                          69,7 s
   50                          28.8 s
  100                          25,5 s
  200                          23,0 s
  500                          26.8 s
 1000                          23,4 s

In none of these I noticed any problems with responsiveness, but then the
problems should be larger with faster network.

I will attach a proposed patch and take the bug (please tell me if I shouldn't
do it, but since noone seems to have taken their time I hope noone will object).
The patch changes NET_IS_SPACE to check for ' ' and '\t' instead of using a
check for ASCII (< 0x7f) and then a call to isspace. RFC 977 (NNTP) says that
SPACE or TAB should be used for whitespace so there is no need for fancy
whitespace checks. The patch also changes READ_NEWS_LIST_COUNT_MAX to 200 which
seems to be a reasonable number. If anyone has the ability to check this patch
with a really fast connection to a fast newsserver please comment in this bug on
any unresponsiveness in Mozilla during newsgroup fetch. I think this throttle
was more important before sspitzer made the other operations so quick.

The patch also changes the code to compare against '.' only once which might
make the code several nanoseconds quicker, and also corrects a comment.

sspitzer, can you review the patch? This will, if the gods are willing, make the
subscribe dialog another ten times faster if the net is fast enough.
Attached patch Proposed patchSplinter Review
Status: NEW → ASSIGNED
Whiteboard: fix in hand
Target Milestone: --- → mozilla0.9
did some tests with different READ_NEWS_LIST_COUNT_MAX values and heres what it got:

200
		19s.
500
		11s.
1000
		8s.

didn't see any problems. perhaps bigger value than 200 would scale a bit better
on faster links though. looks good.
How fast is your computer? Slower computers are worse off when the thunk size
increases. Did you try to surf the web or move windows while fetching newsgroups?

If you did that without trouble and have a slow computer, it looks like we
should go to 1000 or even more.

8 seconds sounds close enough to tin to close this bug, or what do you think?
Btw, did you happen to see if the network buffer problem is gone?
well you really can't surf the web while fetching newsgroups since the subscribe
window is modal, as was mentioned in the bloat bug. the subscribe window's modal
issue is in bug 57266

my machine is 600mhz/256mb, i tried running my test build remote on a p90/48mb
but for some reason it wouldn't start.. i'll try it again later since it'd make
a good comparison.
It's only modal to the mail/news-window. You can still work with any browser
windows you have open. I just checked this again.
sorry for the delay in commenting, I've been on vacation.

I'll go review and try the patch now.

ok.  Daniel is right to up the MAX to 200 from 20.

I'm not convinced it is safe to change the NET_IS_SPACE macro.

isspace() return true for more than space and tab.  among other things, it
returns true for \n and \r.

since NET_IS_SPACE is called in other parts of news code, i'm going to need to
verify that switching to Daniel's faster implementation won't break anything.

the good news is by simply changing the MAX, dowloading is much faster.

I've got changes to properly reflect the download rate in the subscribe dialog
(it currently says "?.? Kb / sec").

I'll go measure the download rates with MAX at 20 and MAX at 200 when using
subscribe to news.mcom.com and news.supernews.com over my ISDN line.
over my ISDN talking to news.supernews.com, here's what I get:

max == 20, rate is 15K / sec.
max == 200, rate is 38K / sec.

with max at 500, I didn't see any gain, but my ISDN is pretty slow.  since I
also didn't see any responsiveness problems, I'm ok with setting it at 500 to
benefit users who have faster connections, assuming this doesn't hurt the
responsiveness on the target machine over a modem connection.  (the target
machine has 64 MB ram and about a 200-300 Mhz CPU, I think)
I'm not 100% sure either. I read the spec and it's enough to check for SPACE and 
TAB while listing news groups and all the places it's used (three places 
alltogether) only handles one line at a time so it feels safe. 

I do know that the isspace method turned out to be really slow on windows 
though. As I said somewhere, 49% percent of the computation time during group 
retrieval was spent in that function so I think we can get somewhere around 30% 
speedup by switching to a raw check for SPACE and TAB. It's difficult to make 
really good measurements since the measurements depend on the network so much. 

It is the constant that causes the big change though. At least for people with 
fast nets. I had somewhere around 15 KB/s before and 50 KB/s afterwards and I 
have no idea where the top is.

For the modem users, I thought that when the input stream was empty and 
NNTP_PAUSE_FOR_READ is set, that the UI-thread would regain control until more 
data arrived? If that's the case the problem will only be for people which never 
get out of data in the input queue and takes too long to process the group 
batch.

For the value of the constant, 500 feels safe. Maybe we could go even higher 
to benefit fast nets even more but it feels risky until someone with a low end 
computer tests it.
you are right about PAUSE_FOR_READ, we should be safe on slower connections
since we will not be starving the UI thread.

I'll attach the patch I plan to land after I get it super reviewed.

note, it includes the UI changes to display the download rate.
adding patch keyword.
Keywords: patch
cc'ing putterman.

a while ago putterman and I were discussing why downloading the list of
newsgroups from the serve was so slow over a fast connection.

Daniel and Janne figure out the problem and provided a patch.

thanks again to daniel and janne for the excellent work.
The patch looks fine (even without my nano second optimization of dot reading. 
:-) ). Only one small thing, isn't KB the common abbreviation for kilobytes? 
(I've not fallen for KiB yet):

-bytesReceived=%S K read (at %SK/sec) 
+bytesReceived=%SK read (at %SK/sec) 

Apart from that, I'm satisfied with the patch. Therefore giving bug back to 
sspitzer. :-)
Assignee: bratell → sspitzer
Status: ASSIGNED → NEW
I used K because that is what we did in 4.x

let's take the K -> KB discussion to bug #61776.

I'll land this patch after I get it super reviewed.

cc'ing bienvenu for super review.
Status: NEW → ASSIGNED
sr=bienvenu (but you might want to double check that this builds on the mac
since it has a funny way with 64 bit #'s)
adding syd the cc list.  he also did a review and gave me some suggestions.

I'll attach the final patch before I mark this fixed.
fixed.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
note, this build fine on the mac.  sadly, I have a mac at home.
This bug is VERIFIED FIXED on Mac 12-05-00, NT 2000120504, and Linux 2000120508.
Status: RESOLVED → VERIFIED
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: