Closed Bug 495308 Opened 15 years ago Closed 14 years ago

All connections to dm-chat01 time out

Categories

(support.mozilla.org Graveyard :: Chat, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: zzxc, Assigned: zzxc)

References

Details

Attachments

(1 file)

For the past week or so, all connections to dm-chat01 time out at random times during the day for about a minute.  All packets sent to the server during this minute appear to be dropped or delayed during this minute, interrupting connections and/or causing long delays before messages show up.  We need to figure out whether this is a problem in Openfire, a problem with the VM/server itself, or a firewall issue.

These timeouts tend to occur 2-3 times per day while live chat is open.  (There have been no recent code changes on production; the next push is scheduled for Tuesday, June 2.)
This happened today, between 08:50-08:57.  No messages were delivered during this time.
This happened again starting at 09:32, lasting until 09:35.
Severity: major → critical
The two times you mentioned it, there were corresponding nagios alerts saying the application was down.

09:35 <@nagios> [15] dm-chat01:SMTPS cert - port 9091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
Assignee: server-ops → aravind
Could this be caused by having too many users in the app?  Have you noticed any trends or any patterns when the app goes down?
This seems to happen when there is activity, such as when a new chat is accepted.  The Openfire logs in /opt/openfire might contain errors relating to this, if it's caused by the application rather than the server or firewall.
We have a bunch of applications on VMs and we haven't noticed behavior like this on any of them.  This leads me to suspect problems in the application itself.  During this time other checks on the box, like disk, load and ping seem to be fine.  Its only the app related checks like jabber/smtps that fail.

I looked through the app logs, but I wasn't able to find anything conclusive.  Are there any other log settings you could enable on the application?
All four logging options are currently enabled in Openfire - errors, information, debug, and a packet log.  It may help to see the last message received before the hang, since it could be caused by a plugin taking too long to process a message.  Would it be possible to get the Openfire logs automatically copied to dm-sumotools01 along with the database dumps?

No code changes have occurred since this problem started, so it's possible that this will be corrected by restarting the server.  I already filed another bug for a restart coinciding with this week's plugin upgrade, bug 495743.
This just happened and it still occuring. 15:53 and on going
It has now ended and service is resumed. This needs attention.
Sent zzxc the logs from around 8:00 AM.

@:TMZ: I am not sure what else we could be doing about this.  What I am seeing on the server, points to application problems.  I have sent the app logs to the developer.  If you have any further suggestions, please let me know.
From the debug log, this is being caused by an infinite loop in org.jivesoftware.xmpp.workgroup.dispatcher.  The dispatching thread is going into an infinite loop for up to several minutes, which is stopping the rest of the server.

Re-assigning to myself.
Assignee: aravind → bugs
Component: Server Operations → Chat
Product: mozilla.org → support.mozilla.com
QA Contact: mrz → chat
Target Milestone: --- → 1.2
Version: other → unspecified
Attached patch Temporary patchSplinter Review
This is a temporary patch to fix the infinite loop by calling Thread.sleep() every time an offer fails to send.  This will prevent the thread from hanging the process with an infinite loop.

This dispatch class needs to be entirely rewritten - it shouldn't be using a thread for each chat request, and it should be offering older chats first.  This temporary fix should solve the stability problem in the meantime.  (Remaining issues will be fixed in bug 468182)
Attachment #381190 - Flags: review?(laura)
Summary: All connections to dm-chat01 time out → All connections to dm-chat01 time out [upstream]
Target Milestone: 1.2 → 1.3
Target Milestone: 1.3 → 1.2
Summary: All connections to dm-chat01 time out [upstream] → All connections to dm-chat01 time out
Comment on attachment 381190 [details] [diff] [review]
Temporary patch

duct_tape++

When do you want to do the rewrite?
Attachment #381190 - Flags: review?(laura) → review+
Checked in r28063, with a 100ms delay

The rewrite will be in bug 468182
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Depends on: 468182
The patch that was checked in seems to have reduced the frequency of this bug, but it has still been happening about twice per week.  The approximate timestamps when this has been reported are listed in bug 500588.

As best as I can tell from the server logs, the hangs are still being caused by resource issues with the RoundRobinDispatcher class.  This issue only occurs when chats are waiting in the queue, and the vast majority of events in the debug log are from this class.  I am not able to reproduce this issue on a development server, however.

The patch that I will be posting to bug 468182 should fix this completely by making the dispatching operation much more efficient.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 1.2 → 1.3
Openfire hung at ~1:40 CDT, probably this same issue. Nagios did not say anything about it in #bmo
Nagios did give a critical message via the web interface, socket timeout.
Target Milestone: 1.3 → 1.5
Target Milestone: 1.5 → 1.4.2
I disabled all Openfire plugins we aren't using, disabled server-to-server connections, and cleaned the conference database tables on production.  After I did this, the timeout issue hasn't happened again.

Since this hasn't happened for several weeks and I'm not positive what fixed it, ->WFM
Status: REOPENED → RESOLVED
Closed: 15 years ago14 years ago
Resolution: --- → WORKSFORME
Product: support.mozilla.org → support.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: