Closed
Bug 495308
Opened 15 years ago
Closed 14 years ago
All connections to dm-chat01 time out
Categories
(support.mozilla.org Graveyard :: Chat, defect)
support.mozilla.org Graveyard
Chat
Tracking
(Not tracked)
RESOLVED
WORKSFORME
1.4.2
People
(Reporter: zzxc, Assigned: zzxc)
References
Details
Attachments
(1 file)
844 bytes,
patch
|
laura
:
review+
|
Details | Diff | Splinter Review |
For the past week or so, all connections to dm-chat01 time out at random times during the day for about a minute. All packets sent to the server during this minute appear to be dropped or delayed during this minute, interrupting connections and/or causing long delays before messages show up. We need to figure out whether this is a problem in Openfire, a problem with the VM/server itself, or a firewall issue. These timeouts tend to occur 2-3 times per day while live chat is open. (There have been no recent code changes on production; the next push is scheduled for Tuesday, June 2.)
Assignee | ||
Comment 1•15 years ago
|
||
This happened today, between 08:50-08:57. No messages were delivered during this time.
Assignee | ||
Comment 2•15 years ago
|
||
This happened again starting at 09:32, lasting until 09:35.
Severity: major → critical
Comment 3•15 years ago
|
||
The two times you mentioned it, there were corresponding nagios alerts saying the application was down. 09:35 <@nagios> [15] dm-chat01:SMTPS cert - port 9091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
Updated•15 years ago
|
Assignee: server-ops → aravind
Comment 4•15 years ago
|
||
Could this be caused by having too many users in the app? Have you noticed any trends or any patterns when the app goes down?
Assignee | ||
Comment 5•15 years ago
|
||
This seems to happen when there is activity, such as when a new chat is accepted. The Openfire logs in /opt/openfire might contain errors relating to this, if it's caused by the application rather than the server or firewall.
Comment 6•15 years ago
|
||
We have a bunch of applications on VMs and we haven't noticed behavior like this on any of them. This leads me to suspect problems in the application itself. During this time other checks on the box, like disk, load and ping seem to be fine. Its only the app related checks like jabber/smtps that fail. I looked through the app logs, but I wasn't able to find anything conclusive. Are there any other log settings you could enable on the application?
Assignee | ||
Comment 7•15 years ago
|
||
All four logging options are currently enabled in Openfire - errors, information, debug, and a packet log. It may help to see the last message received before the hang, since it could be caused by a plugin taking too long to process a message. Would it be possible to get the Openfire logs automatically copied to dm-sumotools01 along with the database dumps? No code changes have occurred since this problem started, so it's possible that this will be corrected by restarting the server. I already filed another bug for a restart coinciding with this week's plugin upgrade, bug 495743.
Comment 8•15 years ago
|
||
This just happened and it still occuring. 15:53 and on going
Comment 9•15 years ago
|
||
It has now ended and service is resumed. This needs attention.
Comment 10•15 years ago
|
||
Sent zzxc the logs from around 8:00 AM. @:TMZ: I am not sure what else we could be doing about this. What I am seeing on the server, points to application problems. I have sent the app logs to the developer. If you have any further suggestions, please let me know.
Assignee | ||
Comment 11•15 years ago
|
||
From the debug log, this is being caused by an infinite loop in org.jivesoftware.xmpp.workgroup.dispatcher. The dispatching thread is going into an infinite loop for up to several minutes, which is stopping the rest of the server. Re-assigning to myself.
Assignee: aravind → bugs
Component: Server Operations → Chat
Product: mozilla.org → support.mozilla.com
QA Contact: mrz → chat
Target Milestone: --- → 1.2
Version: other → unspecified
Assignee | ||
Comment 12•15 years ago
|
||
This is a temporary patch to fix the infinite loop by calling Thread.sleep() every time an offer fails to send. This will prevent the thread from hanging the process with an infinite loop. This dispatch class needs to be entirely rewritten - it shouldn't be using a thread for each chat request, and it should be offering older chats first. This temporary fix should solve the stability problem in the meantime. (Remaining issues will be fixed in bug 468182)
Attachment #381190 -
Flags: review?(laura)
Updated•15 years ago
|
Summary: All connections to dm-chat01 time out → All connections to dm-chat01 time out [upstream]
Target Milestone: 1.2 → 1.3
Updated•15 years ago
|
Target Milestone: 1.3 → 1.2
Updated•15 years ago
|
Summary: All connections to dm-chat01 time out [upstream] → All connections to dm-chat01 time out
Comment 13•15 years ago
|
||
Comment on attachment 381190 [details] [diff] [review] Temporary patch duct_tape++ When do you want to do the rewrite?
Attachment #381190 -
Flags: review?(laura) → review+
Assignee | ||
Comment 14•15 years ago
|
||
Checked in r28063, with a 100ms delay The rewrite will be in bug 468182
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 15•15 years ago
|
||
The patch that was checked in seems to have reduced the frequency of this bug, but it has still been happening about twice per week. The approximate timestamps when this has been reported are listed in bug 500588. As best as I can tell from the server logs, the hangs are still being caused by resource issues with the RoundRobinDispatcher class. This issue only occurs when chats are waiting in the queue, and the vast majority of events in the debug log are from this class. I am not able to reproduce this issue on a development server, however. The patch that I will be posting to bug 468182 should fix this completely by making the dispatching operation much more efficient.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 1.2 → 1.3
Comment 16•15 years ago
|
||
Openfire hung at ~1:40 CDT, probably this same issue. Nagios did not say anything about it in #bmo
Comment 17•15 years ago
|
||
Nagios did give a critical message via the web interface, socket timeout.
Assignee | ||
Updated•15 years ago
|
Target Milestone: 1.3 → 1.5
Assignee | ||
Updated•15 years ago
|
Target Milestone: 1.5 → 1.4.2
Assignee | ||
Comment 18•14 years ago
|
||
I disabled all Openfire plugins we aren't using, disabled server-to-server connections, and cleaned the conference database tables on production. After I did this, the timeout issue hasn't happened again. Since this hasn't happened for several weeks and I'm not positive what fixed it, ->WFM
Status: REOPENED → RESOLVED
Closed: 15 years ago → 14 years ago
Resolution: --- → WORKSFORME
Updated•11 years ago
|
Product: support.mozilla.org → support.mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•