423009 - b.m.o requests get lost / timeouts / connection resets

Reporter

Description

•

16 years ago

STEPS TO REPRODUCE
1. query b.m.o for a specific bug, eg
https://bugzilla.mozilla.org/show_bug.cgi?id=414039
-or-
1. submit a search request using
https://bugzilla.mozilla.org/query.cgi
-or-
just about any other b.m.o request, including submitting bugs, comments...

2. wait for something to happen

ACTUAL RESULTS
Status bar says "waiting for bugzilla.mozilla.org..."
Still no result after waiting a minute.
Making a new request for whatever I was doing usually gives me
the result right away.  It's if some requests just get lost somehow...

This happens to me several times a day and is becoming quite frustrating.
I'm considering giving up bug triage altogether because b.m.o is
simply not usable for tasks that require reasonable response times.

I'm pretty sure this problem started when the move to a cluster
was done a few months back.  It worked fine before then.

Dave Miller [:justdave]

Updated

•

16 years ago

Assignee: justdave → server-ops

Component: Bugzilla: Other b.m.o Issues → Server Operations

QA Contact: reed → justin

Justin Fitzhugh

Comment 1

•

16 years ago

this may be due to your connectivity, as we use bmo from here all day long, with no issues.  not a server side issue - perhaps attach a traceroute to bmo?

Severity: critical → normal

Reed Loden [:reed]

Comment 2

•

16 years ago

I'm not seeing it as bad as what Mats is seeing, but I've definitely seen problems since the move to the cluster, including random "Connection Interrupted" and "Connection Reset" messages.

Dave Miller [:justdave]

Comment 3

•

16 years ago

I've been getting a lot of "Connection Reset" ones myself the last few days in particular.

Mats Palmgren (inactive)

Reporter

Comment 4

•

16 years ago

This bug is critical for me to get my work done.

(In reply to comment #1)
> this may be due to your connectivity, as 

I have an excellent 100Mb/s ethernet link to my ISP.  It works flawlessly
with any other site except b.m.o.

> we use bmo from here all day long with no issues.

then perhaps you should try using it from outside the firewall
or whatever, so you can see what I'm talking about?

Or are problems with accessing b.m.o. from outside the Mozilla Hq.
not important for you to look at?

> not a server side issue - perhaps attach a traceroute to bmo?

Please take this bug seriously.

Severity: normal → critical

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 5

•

16 years ago

I've seen "Connection reset" errors too, from both home and the office. Hard to tell when it started, it may have been some time after the move to the cluster, though it seems more recent than that. I've also seen several other people report similars problems on IRC. Hard to tell whether it's an actual bmo problem, though... could be something on the client (e.g. bug 421566 - possible fallout from enabling pipelining for SSL in trunk builds).

Severity: critical → normal

Mats Palmgren (inactive)

Reporter

Comment 6

•

16 years ago

(In reply to comment #5)
> could be something on the client (e.g. bug 421566 - possible
> fallout from enabling pipelining for SSL in trunk builds).

I think I've seen this problem since before beginning of February though.
It's been more frequent lately, so I guess there could be two
different problems... the original one being some b.m.o. cluster
problem and the other from pipelining.
I would expect more bug reports if it were a client problem though.

I'll try setting network.http.pipelining.ssl=false to see if
that helps...

Justin Fitzhugh

Comment 7

•

16 years ago

(In reply to comment #4)
> This bug is critical for me to get my work done.

Yes, I agree, but not something I am going to have people paged about.  That's why I dropped the sev.

> I have an excellent 100Mb/s ethernet link to my ISP.  It works flawlessly
> with any other site except b.m.o.

The speed of your link means nothing - perhaps you do have great connectivity, but I'd like to rule that out and simply asked for a traceroute.

> then perhaps you should try using it from outside the firewall
> or whatever, so you can see what I'm talking about?

I do, every night when I use bugzilla from home.

> Or are problems with accessing b.m.o. from outside the Mozilla Hq.
> not important for you to look at?

Not at all, I have tested from there too.
 
> Please take this bug seriously. 

I do, but it's still not a critical bug worth paging someone about constantly.  If it wasn't, I would have closed it INVALID.

Mats Palmgren (inactive)

Reporter

Comment 8

•

16 years ago

(In reply to comment #7)
Ok, I misread your tone then, sorry.

matthew zeier [:mrz]

Comment 9

•

16 years ago

(In reply to comment #6)
> I'll try setting network.http.pipelining.ssl=false to see if
> that helps...

Curious - did it?

Mats Palmgren (inactive)

Reporter

Comment 10

•

16 years ago

It seems much better now, yes.  The problem still occurs though, perhaps
2-3 times the last few days.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 11

•

16 years ago

I've been seeing this as well for the past few months, including from the MoCo office (Building S); occasionally a show_bug.cgi load will show a "Connection reset" error page instead of the page, and occasionally a bug will load without its style sheet.

I'd been meaning to bug someone about it, but hadn't gotten around to it...

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 12

•

16 years ago

For what it's worth, I just saw this using wget from the wired network in Building S, so it's neither Mats's connection nor the pipelining stuff.  (There was a bit of a pause before the "Read error", and then the second connection went pretty fast.)

$ wget 'https://bugzilla.mozilla.org/attachment.cgi?id=308998' -O p
--18:49:40--  https://bugzilla.mozilla.org/attachment.cgi?id=308998
           => `p'
Resolving bugzilla.mozilla.org... 63.245.209.72
Connecting to bugzilla.mozilla.org|63.245.209.72|:443... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

--18:49:42--  https://bugzilla.mozilla.org/attachment.cgi?id=308998
  (try: 2) => `p'
Connecting to bugzilla.mozilla.org|63.245.209.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14,554 (14K) [text/plain]

100%[====================================>] 14,554        --.--K/s             

18:49:52 (14.22 MB/s) - `p' saved [14554/14554]

Dave Miller [:justdave]

Comment 13

•

16 years ago

I got it once again today in Minefield, and I had my ssl pipelining disabled after the earlier recommendations to try that, so that's another vote for that not being the issue.

Dave Miller [:justdave]

Updated

•

16 years ago

Summary: b.m.o requests get lost → b.m.o requests get lost / timeouts / connection resets

Mats Palmgren (inactive)

Reporter

Comment 14

•

16 years ago

Having used Bugzilla a bit more the last few days I can say that it
occurs several times a day for me.  A few observations:
1. a request that hangs never times out
2. the progress indicator in the status bar usually stops about half way
3. CTRL+U in a page that hangs shows all HTML
4. DOMI shows that (at least some) stylesheets are loaded (b.m.o/skins/...)
5. File->Work Offline, twice, makes all the hung pages render
6. when a request hangs, opening a new tab and making more b.m.o requests
   until one succeeds, appears to make the first request succeed too
(I have SSL pipelining turned off)
Let me know if there is anything else you want me to try...

Justin Fitzhugh

Comment 15

•

16 years ago

think this is resolved with a few changes on the server side to speed up responses.  can you verify?

Reed Loden [:reed]

Comment 16

•

16 years ago

(In reply to comment #15)
> think this is resolved with a few changes on the server side to speed up
> responses.  can you verify?

I just got "Connection Interrupted" loading show_activity.cgi.

Justin Fitzhugh

Comment 17

•

16 years ago

could be because of the mx window which dave is fixing a few things - check if we still get them tomorrow, I'll be concerned, but tx for the report...

Mats Palmgren (inactive)

Reporter

Comment 18

•

16 years ago

Sorry, but I still see this daily (with SSL pipelining off).

Jeff Walden [:Waldo]

Comment 19

•

16 years ago

I've seen this reasonably often lately with Firefox 2; I don't think this bug is the result of an issue in Firefox 3.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 20

•

16 years ago

I might even be seeing it more often than normal in the past day or so.

Re comment 19:  I've seen it with wget; see comment 12.

Wayne Mery (:wsmwk)

Comment 21

•

16 years ago

seen several times this morning (and bug 421566), on at least two PCs, even though you'd think BMO activity would be very light.

Dave Miller [:justdave]

Comment 22

•

16 years ago

I've gotten it three or four more times in the last two days, too.

Mark Smith [:xb95]

Assignee

Comment 23

•

16 years ago

Discussed at today's IT triage.  Have a plan of attack for trying to narrow this down...

Assignee: server-ops → mark

Brian Crowder

Comment 24

•

16 years ago

Are you guys seeing events being recorded in the server-logs that could yield timeouts/resets such as are being reported in this and bug 421566?

Mark Smith [:xb95]

Assignee

Comment 25

•

16 years ago

No, server logs (both netscaler level and Apache level) don't show the resets.  (We've had a few we had temporal data on and scoured every log, but couldn't find the reset/failure anywhere...)

Anyway, Dave had a theory, and we've put it to the test.  Are people still seeing the resets?  I haven't heard of anybody here in IT seeing them in the past day or so, but it'd be good to cast a wider net.

The theory is: right now Apache is set to have a size limit on the processes.  This limit is set pretty small right now, so Apache ends up killing itself after every request.  Since it's doing keep-alive, we think the netscaler is sending a second request along (either pipelining or just sending it quickly after the response from the first one), then the Apache process terminates, causing the second connection to drop.

This would be exacerbated in high load situations.  The fix we're trying now is that we've set the netscaler to only send one request per backend connection.  This is inefficient of course, but is an easy way to test the theory out.

So yeah - anybody seeing resets still?  Thanks.

Status: NEW → ASSIGNED

Frédéric Buclin

Comment 26

•

16 years ago

(In reply to comment #25)
> So yeah - anybody seeing resets still?  Thanks.

Starting from when? This happened to me again 4 hours ago. Did you do you changes after this time?

Mats Palmgren (inactive)

Reporter

Comment 27

•

16 years ago

(In reply to comment #25)
> Are people still seeing the resets?

Definitely.  Today was probably the worst day so far, b.m.o. failed
about 80% of my requests for a 2 hour period, starting about 5 hours ago.
Seems to work fine now though.

Frédéric Buclin

Comment 28

•

16 years ago

Trying to view a bug right now...
first attempt: timeout
second attempt: CSS files not loaded
third attempt: I finally have the page correctly loaded

my reaction: let's complain to xb95 in bug 423009

first, second, third, fourth and fifth attemps: all timeouts!!
sixth attempt: here I am to complain *heavily*!

b.m.o looks as slow as when mod_perl was not yet supported, or even worse.

Dave Miller [:justdave]

Comment 29

•

16 years ago

For the record, this morning was a DoS attack.  The IP was blocked and everything recovered.

Dave Miller [:justdave]

Comment 30

•

16 years ago

Someone also added mrapp53 into the bugzilla service group at some point, and that server isn't supposed to be serving Bugzilla.  Don't know how much that's responsible for (it didn't get the apache tweaks that 51 and 52 did).  It's been removed again, so if we could start over again on the pipelining theory without that server in the mix...

Wayne Mery (:wsmwk)

Comment 31

•

16 years ago

nothing but resets attempting search of "top 10 stack frames" for 3 month period on crash-stats

Wil Clouser [:clouserw]

Comment 32

•

16 years ago

In a triage meeting right now and seeing this pretty consistently between all our computers (geographically distant)

Mike Beltzner [:beltzner, not reading bugmail]

Comment 33

•

16 years ago

We used to be able to blame this on the MoCoTo internet, but ever since that got resolved, we're starting to see how frequently this hits us. I'd figure about at least 6 times an hour.

Mark Smith [:xb95]

Assignee

Comment 34

•

16 years ago

I am currently doing some packet logging, but it's on a rotating basis so at any point I will have the most recent ~hour of logs.

If anybody starts seeing resets, please find me on IRC (I am xb95)!  You will need to give me the URL you were trying to access, as well as the URL you previously accessed, and your IP address.  This will allow me to find your connection in the millions of packets.  :)

I have been looking at the traffic overall, and I still think the theory is valid, it just wasn't fixed like by setting the NetScaler to only send a single request on a connection.  I am seeing hard RST packets in the streams, which is what I would expect if the NetScaler sends traffic to Apache and it closes the connection.  (If the receive queue has content on a close, typically the kernel sends a RST, IIRC...)

Anyway, still trying to debug this or reproduce this.  Just hit me on IRC when you see this, with the above information, and if I'm around I'll dive in.  Thanks!

Mark Smith [:xb95]

Assignee

Comment 35

•

16 years ago

Haven't had a single person take me up on my offer.  Can't really do much to diagnose this if I can't reproduce it myself, so I have to rely on you all!

Again, if you see a reset, please find me on IRC.  I just need a little bit of information from you.  Thanks!

Mats Palmgren (inactive)

Reporter

Comment 36

•

16 years ago

I haven't seen this problem for a while, so I consider this bug fixed
(with SSL pipelining off anyway).  Thanks!

Wayne Mery (:wsmwk)

Comment 37

•

16 years ago

Is this bug different from geting connection resets with crash-stats and talkback?   

Mark I pinged you on Wednesday or Thursday but guess you were away.

Mark Smith [:xb95]

Assignee

Comment 38

•

16 years ago

Given we don't know for a fact what the underlying problem is, I can't say if it is different or not.  It might be the same root cause, but I just don't know.

I haven't heard anybody reporting this since last week, either.  (But that doesn't mean it's fixed...)

If I hear nothing by next week, I'm going to assume that a combination of isolating Bugzilla to its own webservers and the tweaks to the NetScaler have fixed the problem and close out the bug.

Wayne Mery (:wsmwk)

Comment 39

•

16 years ago

Mark, I no longer see this with bugzilla.

However, crash-stats is a similar problem  - bug 422908

Mark Smith [:xb95]

Assignee

Comment 40

•

16 years ago

I believe this is resolved.  Between the work mrz and justdave did on the NetScaler and application servers themselves, it seems we've ironed out the monkey that caused this.

If this should resurface, we can resurrect this ticket.

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard