Closed
Bug 1094272
Opened 10 years ago
Closed 10 years ago
[pulsetranslator] With SSL enabled publishing of normalized messages get stuck
Categories
(Webtools :: Pulse, defect)
Webtools
Pulse
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mcote, Unassigned)
References
Details
Attachments
(2 files)
We are seeing errors like this in the pulsetranslator:
build.mozilla-aurora.win32.opt.l10n.nightly.mozilla-aurora-win32-l10n-nightly
Traceback (most recent call last):
File "/home/webtools/apps/pulsetranslator/src/pulsetranslator/pulsetranslator/translatorqueues.py", line 26, in publish_message
publisher.publish(msg)
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/mozillapulse/publishers.py", line 62, in publish
routing_key=message.routing_key)
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/messaging.py", line 85, in __init__
self.revive(self._channel)
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/messaging.py", line 218, in revive
self.declare()
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/messaging.py", line 105, in declare
self.exchange.declare()
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/entity.py", line 163, in declare
return self.channel.exchange_declare(
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/abstract.py", line 115, in channel
channel = self._channel = channel()
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/utils/__init__.py", line 422, in __call__
value = self.__value__ = self.__contract__()
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/messaging.py", line 205, in <lambda>
channel = ChannelPromise(lambda: connection.default_channel)
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/connection.py", line 758, in default_channel
self._default_channel = self.channel()
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/connection.py", line 242, in channel
chan = self.transport.create_channel(self.connection)
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/transport/pyamqp.py", line 88, in create_channel
return connection.channel()
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/connection.py", line 281, in channel
return self.Channel(self, channel_id)
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/channel.py", line 99, in __init__
self._x_open()
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/channel.py", line 427, in _x_open
(20, 11), # Channel.open_ok
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/abstract_channel.py", line 67, in wait
self.channel_id, allowed_methods)
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/connection.py", line 240, in _wait_method
self.method_reader.read_method()
File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/method_framing.py", line 189, in read_method
raise m
IOError: Socket closed
This may be related to the network issues reported in bug 1083166. Regardless, we should never drop messages.
We could put in a simple retry loop, but this won't help if there's a prolonged outage. pulsetranslator should be queuing its messages somewhere and only removing them from the queue when the publish() call succeeds.
![]() |
Assignee | |
Comment 1•10 years ago
|
||
Actually, that's probably overkill. If we presume that a prolonged outage means we won't be able to *receive* messages either, then we don't need a local queue. We should just retry the current message until it succeeds before consuming the next message. We should also cache the current message in case pulsetranslator is restarted when in a publish loop.
![]() |
Assignee | |
Comment 2•10 years ago
|
||
Also, we should *really* have timestamps in the error logs.
Comment 3•10 years ago
|
||
Something else I see as important too is to add Nagios checks for that host and pulsetranslator specific stats. Or if that is overkill lets get send an email to someone, so we are aware that lots of messages weren't sent yet. This can totally be a follow-up, for now it would be very important for us to get a correct publishing in place. Thanks!
Whiteboard: [qa-automation-blocked]
![]() |
Assignee | |
Comment 4•10 years ago
|
||
Well, now I'm totally confused. We *do* have a retry loop, an infinite one in fact: https://github.com/mozilla/pulsetranslator/blob/master/pulsetranslator/translatorqueues.py#L23
I was confused because we still log the exception, but we keep trying. The fact that the exception I noted above appears only once in the log file implies that it actually succeeded on the second try.
So I'm back to being baffled by this bug.
![]() |
Assignee | |
Comment 5•10 years ago
|
||
I restarted the pulsetranslator, and I can see messages coming out again. This is still deeply mysterious, though--how pulsetranslator's queue was being consumed but no messages being published is very strange.
![]() |
Assignee | |
Comment 6•10 years ago
|
||
Just to be clear: it appears *no* messages were flowing out of pulsetranslator, which is why whimboo didn't see any today. I'm not sure how long this has been going on for, but we really need some sort of monitoring that alerts someone if the flow of normalized buildbot messages stops, since that should normally never happen.
Comment 7•10 years ago
|
||
Mark, so some questions:
* While the queue was piling up for the former instance of pulsetranslator, a restart cleared it completely by sending out all the queued up normalized messages?
* I wonder how long it takes for pulsetranslator to reach such a bad state. Are this months, weeks, days, or even a single day? Also maybe the number of messages, which have been processed, might play into here. Or the network issues (bug 1083166), which could cause a hick-up of pulsetranslator? Can we add at least some logging for the number of entries in the queue? So we can see when the entries are above a given threshold?
* Do we explicitly log what prevented pulsetranslator to publish the message? Could you add the error log as attachment for inspection?
* Can we get this host manually be spot-checked each day until a proper fix has been made available and has been deployed? This is critical for us, given that our automation completely relies on those messages, and if not received no test results will be available for that day.
Assignee: nobody → mcote
Summary: pulsetranslator failing to publish some messages → pulsetranslator completely failing to publish normalized messages
Comment 8•10 years ago
|
||
All candidate builds for Firefox 33.0.3 have been published to the FTP server (ftp://ftp.mozilla.org/pub/firefox/candidates/33.0.3-candidates/build2/), but again not a single message from pulsetranslator!
Raising severity given that this is a blocker for us at the moment.
Severity: normal → blocker
Comment 9•10 years ago
|
||
Interestingly the messages finally came through. I wonder which problems we had in publishing them.
![]() |
Assignee | |
Comment 10•10 years ago
|
||
I believe the child process, which publishes the messages, must have crashed, leaving the parent process continuing to consume messages and write them to its internal queue, effectively losing them.
There's no good reason for using multiple processes here; we should be using the fact that, if it takes a while to publish a message, RabbitMQ will queue the others, and I don't think there's any time limit to the consumer callback. Thus I combined them into one process, so at least if it crashes, it'll be a lot more obvious (Pulse queue will start growing), and we won't lose any messages.
I didn't implement any sort of caching here, but the easiest way to do this would be again to let RabbitMQ do the work--the message should only be acknowledged once it is successfully published (or we give up). Thus if pulsetranslator is shut down while trying to publish a message, the original message will be unacked and therefore consumed again when it restarts.
I also added timestamps to the logs, improved the README a little, and did a few miscellaneous cleanups.
Attachment #8518608 -
Flags: review?(jgriffin)
Comment 11•10 years ago
|
||
We had this issue again today, with the Firefox 34 beta 7 builds. There was a timeot and we didn't receive the notifications
![]() |
Assignee | |
Comment 12•10 years ago
|
||
Yeah looks like it half-crashed again. We should be able to get this fix out today.
Comment 13•10 years ago
|
||
What I really wonder is why the hell all this starts right now out of a sudden!? This code is in for so long and never caused those massive problems. Is that caused by some system updates on the pulsetranslator machine, or really the suspicious network issues?
But good to hear that we can get it fixed today! We really appreciate that. Thanks
![]() |
Assignee | |
Comment 14•10 years ago
|
||
The crash is due to an unexpected value of the 'platform' field (None). My guess is that something was changed in buildbot that started sending these out.
Comment 15•10 years ago
|
||
(In reply to Mark Côté [:mcote] from comment #14)
> The crash is due to an unexpected value of the 'platform' field (None). My
> guess is that something was changed in buildbot that started sending these
> out.
Do you have such a message available, and can you attach it to the bug please? I would like to know more about it. Thanks.
Flags: needinfo?(mcote)
![]() |
||
Comment 16•10 years ago
|
||
Comment on attachment 8518608 [details] [review]
GitHub PR
r+, but if this causes us to trigger nagios alerts, we may have to back out and re-implement with a threading queue.
Attachment #8518608 -
Flags: review?(jgriffin) → review+
![]() |
Assignee | |
Comment 17•10 years ago
|
||
Okay it's not what I thought. The child process wasn't crashing; it was just getting stuck for really long periods of time--so that explains the very delayed messages whimboo was seeing. The delays that I saw were during ssl setup or tear down. I disabled ssl for the pulsetranslator publisher, and I haven't seen it get stuck since.
So yeah, further evidence that weird things are happening on that network.
Flags: needinfo?(mcote)
Comment 18•10 years ago
|
||
Interesting news Mark! I will forward this information in the networking bug once I have an answer to my next question... Which port is the SSL connecion using on the Pulse side? I would need that info give networking people the right filter to search with.
Btw. would disabling SSL for pulsetranslator also require to change pulsebuildmonitor to not use SSL for now? I ask because it will be a different port and in our network we currently only have:
dest: pulse.mozilla.org ; port 5672
Flags: needinfo?(mcote)
Updated•10 years ago
|
Summary: pulsetranslator completely failing to publish normalized messages → [pulsetranslator] With SSL enabled publishing of normalized messages get stuck
![]() |
Assignee | |
Comment 19•10 years ago
|
||
Regular (non-SSL) is on port 5672; ssl is on port 5671.
Every connection to Pulse is separate, so the fact that pulsetranslator is currently publishing on the non-SSL port has no effect on any SSL connections from consumers or other publishers.
Flags: needinfo?(mcote)
Comment 20•10 years ago
|
||
(In reply to Mark Côté [:mcote] from comment #19)
> Every connection to Pulse is separate, so the fact that pulsetranslator is
> currently publishing on the non-SSL port has no effect on any SSL
> connections from consumers or other publishers.
Sure, but other consumers can have the same issue with SSL connections and would not receive messages. That's why I'm asking.
So it looks like that we have to get our ACLs for Mozmill and TPS updated to include 5672 too.
Comment 21•10 years ago
|
||
I filed bug 1102454 to get our ACLs enabled for SSL connections via port 5671. I will shortly update the network bug regarding the timeouts.
Updated•10 years ago
|
Whiteboard: [qa-automation-blocked]
Comment 22•10 years ago
|
||
We got some weird notification today.
Received on the 24th for a presumably build form the 22nd (with some fields correct, see "previous_buildid" which is indeed yesterday's build).
Comment 23•10 years ago
|
||
Mark, can you please check if this was a fault in pulsetranslator or the received message is correct, and the fault lies with the build itself.
Flags: needinfo?(mcote)
Comment 24•10 years ago
|
||
Mark, what Andrei missed to add is, that we already got such a notification yesterday. So we are a bit surprised about this additional one coming in nearly a day later. We had this in the past already, so I wonder if we sometimes produce duplicates in pulsetranslator. If not we may receive those dupes from pulse directly?
![]() |
Assignee | |
Comment 25•10 years ago
|
||
I suppose it *might* be possible for a duplicate to be sent, if an exception occurred after the message was sent while the connection was being closed or thereabouts. It coming a day later, however, it very weird, unless it was unable to send any messages for a whole day.
Perhaps we need an independent system listening to both raw and normalized build messages and correlating them, issuing alerts when a message should be received but isn't or is received more than once. Given all the weirdness lately, it's probably a good idea. Shouldn't be too hard... I'll find time, or perhaps find a contributor.
Flags: needinfo?(mcote)
![]() |
Assignee | |
Comment 27•10 years ago
|
||
Forgot to mention that I landed the patch to remove process separation a while ago, which helped me track down the SSL issue: https://github.com/mozilla/pulsetranslator/commit/2c34b9ac863d55de1f963e845ce9a42f4d8bcd7a
Comment 28•10 years ago
|
||
(In reply to Mark Côté [:mcote] from comment #25)
> Perhaps we need an independent system listening to both raw and normalized
> build messages and correlating them, issuing alerts when a message should be
> received but isn't or is received more than once. Given all the weirdness
> lately, it's probably a good idea. Shouldn't be too hard... I'll find time,
> or perhaps find a contributor.
That would be great. Sadly I won't have the time for that in the next weeks. :/
Comment 29•10 years ago
|
||
Just a quick update here. Since we have disabled SSL we never missed a build message again! So SSL is definitely to blame.
![]() |
Assignee | |
Comment 30•10 years ago
|
||
We've now moved Pulse to CloudAMQP and pulsetranslator to Heroku. It is back to using SSL, as whatever was causing problems in the datacentre shouldn't be an issue any longer. Please reopen if you see missing messages.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 32•10 years ago
|
||
(In reply to Mark Côté [:mcote] from comment #30)
> We've now moved Pulse to CloudAMQP and pulsetranslator to Heroku. It is
This move actually broke our whole mozmill-ci setup and we are no longer able to access Pulse due to restrictions for networking routes in SCL3. This move should have been announced to consumers of Pulse. Also I do not see any single bug in the Bugzilla Pulse component about this move. :(
I think with that move we got at least one new host for which we have to get an ACL entry added.
Flags: needinfo?(mcote)
Flags: needinfo?(jgriffin)
Comment 33•10 years ago
|
||
Btw I handle our currently broken state on bug 1215464.
Comment 34•10 years ago
|
||
And bug 1205867 is actually the regressor. So I will move ni? and all others over to our bug 1215464 and bug 1205867.
Flags: needinfo?(mcote)
Flags: needinfo?(jgriffin)
You need to log in
before you can comment on or make changes to this bug.
Description
•