495308 - All connections to dm-chat01 time out

Assignee

Description

•

16 years ago

For the past week or so, all connections to dm-chat01 time out at random times during the day for about a minute. All packets sent to the server during this minute appear to be dropped or delayed during this minute, interrupting connections and/or causing long delays before messages show up. We need to figure out whether this is a problem in Openfire, a problem with the VM/server itself, or a firewall issue. These timeouts tend to occur 2-3 times per day while live chat is open. (There have been no recent code changes on production; the next push is scheduled for Tuesday, June 2.)

Matthew Middleton (:zzxc)

Assignee

Comment 1

•

16 years ago

This happened today, between 08:50-08:57. No messages were delivered during this time.

Matthew Middleton (:zzxc)

Assignee

Comment 2

•

16 years ago

This happened again starting at 09:32, lasting until 09:35.

Severity: major → critical

Aravind Gottipati [:aravind]

Comment 3

•

16 years ago

The two times you mentioned it, there were corresponding nagios alerts saying the application was down. 09:35 <@nagios> [15] dm-chat01:SMTPS cert - port 9091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds

Aravind Gottipati [:aravind]

Updated

•

16 years ago

Assignee: server-ops → aravind

Aravind Gottipati [:aravind]

Comment 4

•

16 years ago

Could this be caused by having too many users in the app? Have you noticed any trends or any patterns when the app goes down?

Matthew Middleton (:zzxc)

Assignee

Comment 5

•

16 years ago

This seems to happen when there is activity, such as when a new chat is accepted. The Openfire logs in /opt/openfire might contain errors relating to this, if it's caused by the application rather than the server or firewall.

Aravind Gottipati [:aravind]

Comment 6

•

16 years ago

We have a bunch of applications on VMs and we haven't noticed behavior like this on any of them. This leads me to suspect problems in the application itself. During this time other checks on the box, like disk, load and ping seem to be fine. Its only the app related checks like jabber/smtps that fail. I looked through the app logs, but I wasn't able to find anything conclusive. Are there any other log settings you could enable on the application?

Matthew Middleton (:zzxc)

Assignee

Comment 7

•

16 years ago

All four logging options are currently enabled in Openfire - errors, information, debug, and a packet log. It may help to see the last message received before the hang, since it could be caused by a plugin taking too long to process a message. Would it be possible to get the Openfire logs automatically copied to dm-sumotools01 along with the database dumps? No code changes have occurred since this problem started, so it's possible that this will be corrected by restarting the server. I already filed another bug for a restart coinciding with this week's plugin upgrade, bug 495743.

Tom Ellins [:TMZ]

Comment 8

•

16 years ago

This just happened and it still occuring. 15:53 and on going

Tom Ellins [:TMZ]

Comment 9

•

16 years ago

It has now ended and service is resumed. This needs attention.

Aravind Gottipati [:aravind]

Comment 10

•

16 years ago

Sent zzxc the logs from around 8:00 AM. @:TMZ: I am not sure what else we could be doing about this. What I am seeing on the server, points to application problems. I have sent the app logs to the developer. If you have any further suggestions, please let me know.

Matthew Middleton (:zzxc)

Assignee

Comment 11

•

16 years ago

From the debug log, this is being caused by an infinite loop in org.jivesoftware.xmpp.workgroup.dispatcher. The dispatching thread is going into an infinite loop for up to several minutes, which is stopping the rest of the server. Re-assigning to myself.

Assignee: aravind → bugs

Component: Server Operations → Chat

Product: mozilla.org → support.mozilla.com

QA Contact: mrz → chat

Target Milestone: --- → 1.2

Version: other → unspecified

Matthew Middleton (:zzxc)

Assignee

Comment 12

•

16 years ago

Attached patch Temporary patch — Details — Splinter Review

This is a temporary patch to fix the infinite loop by calling Thread.sleep() every time an offer fails to send. This will prevent the thread from hanging the process with an infinite loop. This dispatch class needs to be entirely rewritten - it shouldn't be using a thread for each chat request, and it should be offering older chats first. This temporary fix should solve the stability problem in the meantime. (Remaining issues will be fixed in bug 468182)

Attachment #381190 - Flags: review?(laura)

Laura Thomson :laura

Updated

•

16 years ago

Summary: All connections to dm-chat01 time out → All connections to dm-chat01 time out [upstream]

Target Milestone: 1.2 → 1.3

Laura Thomson :laura

Updated

•

16 years ago

Target Milestone: 1.3 → 1.2

Laura Thomson :laura

Updated

•

16 years ago

Summary: All connections to dm-chat01 time out [upstream] → All connections to dm-chat01 time out

Laura Thomson :laura

Comment 13

•

16 years ago

Comment on attachment 381190 [details] [diff] [review] Temporary patch duct_tape++ When do you want to do the rewrite?

Attachment #381190 - Flags: review?(laura) → review+

Matthew Middleton (:zzxc)

Assignee

Comment 14

•

16 years ago

Checked in r28063, with a 100ms delay The rewrite will be in bug 468182

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Matthew Middleton (:zzxc)

Assignee

Updated

•

16 years ago

Depends on: 468182

Matthew Middleton (:zzxc)

Assignee

Comment 15

•

16 years ago

The patch that was checked in seems to have reduced the frequency of this bug, but it has still been happening about twice per week. The approximate timestamps when this has been reported are listed in bug 500588. As best as I can tell from the server logs, the hangs are still being caused by resource issues with the RoundRobinDispatcher class. This issue only occurs when chats are waiting in the queue, and the vast majority of events in the debug log are from this class. I am not able to reproduce this issue on a development server, however. The patch that I will be posting to bug 468182 should fix this completely by making the dispatching operation much more efficient.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Target Milestone: 1.2 → 1.3

Tanner Filip [:tanner]

Comment 16

•

16 years ago

Openfire hung at ~1:40 CDT, probably this same issue. Nagios did not say anything about it in #bmo

Tom Ellins [:TMZ]

Comment 17

•

16 years ago

Nagios did give a critical message via the web interface, socket timeout.

Matthew Middleton (:zzxc)

Assignee

Updated

•

16 years ago

Target Milestone: 1.3 → 1.5

Matthew Middleton (:zzxc)

Assignee

Updated

•

16 years ago

Target Milestone: 1.5 → 1.4.2

Matthew Middleton (:zzxc)

Assignee

Comment 18

•

15 years ago

I disabled all Openfire plugins we aren't using, disabled server-to-server connections, and cleaned the conference database tables on production. After I did this, the timeout issue hasn't happened again. Since this hasn't happened for several weeks and I'm not positive what fixed it, ->WFM

Status: REOPENED → RESOLVED

Closed: 16 years ago → 15 years ago

Resolution: --- → WORKSFORME

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: support.mozilla.org → support.mozilla.org Graveyard