Closed Bug 423009 Opened 16 years ago Closed 16 years ago

b.m.o requests get lost / timeouts / connection resets

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: MatsPalmgren_bugz, Assigned: xb95)

References

()

Details

(Keywords: dogfood, regression)

STEPS TO REPRODUCE
1. query b.m.o for a specific bug, eg
https://bugzilla.mozilla.org/show_bug.cgi?id=414039
-or-
1. submit a search request using
https://bugzilla.mozilla.org/query.cgi
-or-
just about any other b.m.o request, including submitting bugs, comments...

2. wait for something to happen

ACTUAL RESULTS
Status bar says "waiting for bugzilla.mozilla.org..."
Still no result after waiting a minute.
Making a new request for whatever I was doing usually gives me
the result right away.  It's if some requests just get lost somehow...

This happens to me several times a day and is becoming quite frustrating.
I'm considering giving up bug triage altogether because b.m.o is
simply not usable for tasks that require reasonable response times.

I'm pretty sure this problem started when the move to a cluster
was done a few months back.  It worked fine before then.
Assignee: justdave → server-ops
Component: Bugzilla: Other b.m.o Issues → Server Operations
QA Contact: reed → justin
this may be due to your connectivity, as we use bmo from here all day long, with no issues.  not a server side issue - perhaps attach a traceroute to bmo?
Severity: critical → normal
I'm not seeing it as bad as what Mats is seeing, but I've definitely seen problems since the move to the cluster, including random "Connection Interrupted" and "Connection Reset" messages.
I've been getting a lot of "Connection Reset" ones myself the last few days in particular.
This bug is critical for me to get my work done.

(In reply to comment #1)
> this may be due to your connectivity, as 

I have an excellent 100Mb/s ethernet link to my ISP.  It works flawlessly
with any other site except b.m.o.

> we use bmo from here all day long with no issues.

then perhaps you should try using it from outside the firewall
or whatever, so you can see what I'm talking about?

Or are problems with accessing b.m.o. from outside the Mozilla Hq.
not important for you to look at?

> not a server side issue - perhaps attach a traceroute to bmo?

Please take this bug seriously.
Severity: normal → critical
I've seen "Connection reset" errors too, from both home and the office. Hard to tell when it started, it may have been some time after the move to the cluster, though it seems more recent than that. I've also seen several other people report similars problems on IRC. Hard to tell whether it's an actual bmo problem, though... could be something on the client (e.g. bug 421566 - possible fallout from enabling pipelining for SSL in trunk builds).
Severity: critical → normal
(In reply to comment #5)
> could be something on the client (e.g. bug 421566 - possible
> fallout from enabling pipelining for SSL in trunk builds).

I think I've seen this problem since before beginning of February though.
It's been more frequent lately, so I guess there could be two
different problems... the original one being some b.m.o. cluster
problem and the other from pipelining.
I would expect more bug reports if it were a client problem though.

I'll try setting network.http.pipelining.ssl=false to see if
that helps...
(In reply to comment #4)
> This bug is critical for me to get my work done.

Yes, I agree, but not something I am going to have people paged about.  That's why I dropped the sev.

> I have an excellent 100Mb/s ethernet link to my ISP.  It works flawlessly
> with any other site except b.m.o.

The speed of your link means nothing - perhaps you do have great connectivity, but I'd like to rule that out and simply asked for a traceroute.

> then perhaps you should try using it from outside the firewall
> or whatever, so you can see what I'm talking about?

I do, every night when I use bugzilla from home.

> Or are problems with accessing b.m.o. from outside the Mozilla Hq.
> not important for you to look at?

Not at all, I have tested from there too.
 
> Please take this bug seriously. 

I do, but it's still not a critical bug worth paging someone about constantly.  If it wasn't, I would have closed it INVALID.
(In reply to comment #7)
Ok, I misread your tone then, sorry.
(In reply to comment #6)
> I'll try setting network.http.pipelining.ssl=false to see if
> that helps...

Curious - did it?
It seems much better now, yes.  The problem still occurs though, perhaps
2-3 times the last few days.
I've been seeing this as well for the past few months, including from the MoCo office (Building S); occasionally a show_bug.cgi load will show a "Connection reset" error page instead of the page, and occasionally a bug will load without its style sheet.

I'd been meaning to bug someone about it, but hadn't gotten around to it...
For what it's worth, I just saw this using wget from the wired network in Building S, so it's neither Mats's connection nor the pipelining stuff.  (There was a bit of a pause before the "Read error", and then the second connection went pretty fast.)

$ wget 'https://bugzilla.mozilla.org/attachment.cgi?id=308998' -O p
--18:49:40--  https://bugzilla.mozilla.org/attachment.cgi?id=308998
           => `p'
Resolving bugzilla.mozilla.org... 63.245.209.72
Connecting to bugzilla.mozilla.org|63.245.209.72|:443... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

--18:49:42--  https://bugzilla.mozilla.org/attachment.cgi?id=308998
  (try: 2) => `p'
Connecting to bugzilla.mozilla.org|63.245.209.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14,554 (14K) [text/plain]

100%[====================================>] 14,554        --.--K/s             

18:49:52 (14.22 MB/s) - `p' saved [14554/14554]
I got it once again today in Minefield, and I had my ssl pipelining disabled after the earlier recommendations to try that, so that's another vote for that not being the issue.
Summary: b.m.o requests get lost → b.m.o requests get lost / timeouts / connection resets
Having used Bugzilla a bit more the last few days I can say that it
occurs several times a day for me.  A few observations:
1. a request that hangs never times out
2. the progress indicator in the status bar usually stops about half way
3. CTRL+U in a page that hangs shows all HTML
4. DOMI shows that (at least some) stylesheets are loaded (b.m.o/skins/...)
5. File->Work Offline, twice, makes all the hung pages render
6. when a request hangs, opening a new tab and making more b.m.o requests
   until one succeeds, appears to make the first request succeed too
(I have SSL pipelining turned off)
Let me know if there is anything else you want me to try...
think this is resolved with a few changes on the server side to speed up responses.  can you verify?
(In reply to comment #15)
> think this is resolved with a few changes on the server side to speed up
> responses.  can you verify?

I just got "Connection Interrupted" loading show_activity.cgi.
could be because of the mx window which dave is fixing a few things - check if we still get them tomorrow, I'll be concerned, but tx for the report...
Sorry, but I still see this daily (with SSL pipelining off).
I've seen this reasonably often lately with Firefox 2; I don't think this bug is the result of an issue in Firefox 3.
I might even be seeing it more often than normal in the past day or so.

Re comment 19:  I've seen it with wget; see comment 12.
seen several times this morning (and bug 421566), on at least two PCs, even though you'd think BMO activity would be very light. 
I've gotten it three or four more times in the last two days, too.
Discussed at today's IT triage.  Have a plan of attack for trying to narrow this down...
Assignee: server-ops → mark
Are you guys seeing events being recorded in the server-logs that could yield timeouts/resets such as are being reported in this and bug 421566?
No, server logs (both netscaler level and Apache level) don't show the resets.  (We've had a few we had temporal data on and scoured every log, but couldn't find the reset/failure anywhere...)

Anyway, Dave had a theory, and we've put it to the test.  Are people still seeing the resets?  I haven't heard of anybody here in IT seeing them in the past day or so, but it'd be good to cast a wider net.

The theory is: right now Apache is set to have a size limit on the processes.  This limit is set pretty small right now, so Apache ends up killing itself after every request.  Since it's doing keep-alive, we think the netscaler is sending a second request along (either pipelining or just sending it quickly after the response from the first one), then the Apache process terminates, causing the second connection to drop.

This would be exacerbated in high load situations.  The fix we're trying now is that we've set the netscaler to only send one request per backend connection.  This is inefficient of course, but is an easy way to test the theory out.

So yeah - anybody seeing resets still?  Thanks.
Status: NEW → ASSIGNED
(In reply to comment #25)
> So yeah - anybody seeing resets still?  Thanks.

Starting from when? This happened to me again 4 hours ago. Did you do you changes after this time?
(In reply to comment #25)
> Are people still seeing the resets?

Definitely.  Today was probably the worst day so far, b.m.o. failed
about 80% of my requests for a 2 hour period, starting about 5 hours ago.
Seems to work fine now though.
Trying to view a bug right now...
first attempt: timeout
second attempt: CSS files not loaded
third attempt: I finally have the page correctly loaded

my reaction: let's complain to xb95 in bug 423009

first, second, third, fourth and fifth attemps: all timeouts!!
sixth attempt: here I am to complain *heavily*!

b.m.o looks as slow as when mod_perl was not yet supported, or even worse.
For the record, this morning was a DoS attack.  The IP was blocked and everything recovered.
Someone also added mrapp53 into the bugzilla service group at some point, and that server isn't supposed to be serving Bugzilla.  Don't know how much that's responsible for (it didn't get the apache tweaks that 51 and 52 did).  It's been removed again, so if we could start over again on the pipelining theory without that server in the mix...
nothing but resets attempting search of "top 10 stack frames" for 3 month period on crash-stats 
In a triage meeting right now and seeing this pretty consistently between all our computers (geographically distant)
We used to be able to blame this on the MoCoTo internet, but ever since that got resolved, we're starting to see how frequently this hits us. I'd figure about at least 6 times an hour.
I am currently doing some packet logging, but it's on a rotating basis so at any point I will have the most recent ~hour of logs.

If anybody starts seeing resets, please find me on IRC (I am xb95)!  You will need to give me the URL you were trying to access, as well as the URL you previously accessed, and your IP address.  This will allow me to find your connection in the millions of packets.  :)

I have been looking at the traffic overall, and I still think the theory is valid, it just wasn't fixed like by setting the NetScaler to only send a single request on a connection.  I am seeing hard RST packets in the streams, which is what I would expect if the NetScaler sends traffic to Apache and it closes the connection.  (If the receive queue has content on a close, typically the kernel sends a RST, IIRC...)

Anyway, still trying to debug this or reproduce this.  Just hit me on IRC when you see this, with the above information, and if I'm around I'll dive in.  Thanks!
Haven't had a single person take me up on my offer.  Can't really do much to diagnose this if I can't reproduce it myself, so I have to rely on you all!

Again, if you see a reset, please find me on IRC.  I just need a little bit of information from you.  Thanks!
I haven't seen this problem for a while, so I consider this bug fixed
(with SSL pipelining off anyway).  Thanks!
Is this bug different from geting connection resets with crash-stats and talkback?   

Mark I pinged you on Wednesday or Thursday but guess you were away.
Given we don't know for a fact what the underlying problem is, I can't say if it is different or not.  It might be the same root cause, but I just don't know.

I haven't heard anybody reporting this since last week, either.  (But that doesn't mean it's fixed...)

If I hear nothing by next week, I'm going to assume that a combination of isolating Bugzilla to its own webservers and the tweaks to the NetScaler have fixed the problem and close out the bug.
Mark, I no longer see this with bugzilla.

However, crash-stats is a similar problem  - bug 422908
I believe this is resolved.  Between the work mrz and justdave did on the NetScaler and application servers themselves, it seems we've ironed out the monkey that caused this.

If this should resurface, we can resurrect this ticket.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.