Closed Bug 1094272 Opened 10 years ago Closed 10 years ago

[pulsetranslator] With SSL enabled publishing of normalized messages get stuck

Categories

(Webtools :: Pulse, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mcote, Unassigned)

References

Details

Attachments

(2 files)

We are seeing errors like this in the pulsetranslator: build.mozilla-aurora.win32.opt.l10n.nightly.mozilla-aurora-win32-l10n-nightly Traceback (most recent call last): File "/home/webtools/apps/pulsetranslator/src/pulsetranslator/pulsetranslator/translatorqueues.py", line 26, in publish_message publisher.publish(msg) File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/mozillapulse/publishers.py", line 62, in publish routing_key=message.routing_key) File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/messaging.py", line 85, in __init__ self.revive(self._channel) File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/messaging.py", line 218, in revive self.declare() File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/messaging.py", line 105, in declare self.exchange.declare() File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/entity.py", line 163, in declare return self.channel.exchange_declare( File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/abstract.py", line 115, in channel channel = self._channel = channel() File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/utils/__init__.py", line 422, in __call__ value = self.__value__ = self.__contract__() File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/messaging.py", line 205, in <lambda> channel = ChannelPromise(lambda: connection.default_channel) File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/connection.py", line 758, in default_channel self._default_channel = self.channel() File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/connection.py", line 242, in channel chan = self.transport.create_channel(self.connection) File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/kombu/transport/pyamqp.py", line 88, in create_channel return connection.channel() File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/connection.py", line 281, in channel return self.Channel(self, channel_id) File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/channel.py", line 99, in __init__ self._x_open() File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/channel.py", line 427, in _x_open (20, 11), # Channel.open_ok File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/abstract_channel.py", line 67, in wait self.channel_id, allowed_methods) File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/connection.py", line 240, in _wait_method self.method_reader.read_method() File "/home/webtools/apps/pulsetranslator/lib/python2.6/site-packages/amqp/method_framing.py", line 189, in read_method raise m IOError: Socket closed This may be related to the network issues reported in bug 1083166. Regardless, we should never drop messages. We could put in a simple retry loop, but this won't help if there's a prolonged outage. pulsetranslator should be queuing its messages somewhere and only removing them from the queue when the publish() call succeeds.
Actually, that's probably overkill. If we presume that a prolonged outage means we won't be able to *receive* messages either, then we don't need a local queue. We should just retry the current message until it succeeds before consuming the next message. We should also cache the current message in case pulsetranslator is restarted when in a publish loop.
Also, we should *really* have timestamps in the error logs.
Something else I see as important too is to add Nagios checks for that host and pulsetranslator specific stats. Or if that is overkill lets get send an email to someone, so we are aware that lots of messages weren't sent yet. This can totally be a follow-up, for now it would be very important for us to get a correct publishing in place. Thanks!
Whiteboard: [qa-automation-blocked]
Well, now I'm totally confused. We *do* have a retry loop, an infinite one in fact: https://github.com/mozilla/pulsetranslator/blob/master/pulsetranslator/translatorqueues.py#L23 I was confused because we still log the exception, but we keep trying. The fact that the exception I noted above appears only once in the log file implies that it actually succeeded on the second try. So I'm back to being baffled by this bug.
I restarted the pulsetranslator, and I can see messages coming out again. This is still deeply mysterious, though--how pulsetranslator's queue was being consumed but no messages being published is very strange.
Just to be clear: it appears *no* messages were flowing out of pulsetranslator, which is why whimboo didn't see any today. I'm not sure how long this has been going on for, but we really need some sort of monitoring that alerts someone if the flow of normalized buildbot messages stops, since that should normally never happen.
Mark, so some questions: * While the queue was piling up for the former instance of pulsetranslator, a restart cleared it completely by sending out all the queued up normalized messages? * I wonder how long it takes for pulsetranslator to reach such a bad state. Are this months, weeks, days, or even a single day? Also maybe the number of messages, which have been processed, might play into here. Or the network issues (bug 1083166), which could cause a hick-up of pulsetranslator? Can we add at least some logging for the number of entries in the queue? So we can see when the entries are above a given threshold? * Do we explicitly log what prevented pulsetranslator to publish the message? Could you add the error log as attachment for inspection? * Can we get this host manually be spot-checked each day until a proper fix has been made available and has been deployed? This is critical for us, given that our automation completely relies on those messages, and if not received no test results will be available for that day.
Assignee: nobody → mcote
Summary: pulsetranslator failing to publish some messages → pulsetranslator completely failing to publish normalized messages
All candidate builds for Firefox 33.0.3 have been published to the FTP server (ftp://ftp.mozilla.org/pub/firefox/candidates/33.0.3-candidates/build2/), but again not a single message from pulsetranslator! Raising severity given that this is a blocker for us at the moment.
Severity: normal → blocker
Interestingly the messages finally came through. I wonder which problems we had in publishing them.
Attached file GitHub PR
I believe the child process, which publishes the messages, must have crashed, leaving the parent process continuing to consume messages and write them to its internal queue, effectively losing them. There's no good reason for using multiple processes here; we should be using the fact that, if it takes a while to publish a message, RabbitMQ will queue the others, and I don't think there's any time limit to the consumer callback. Thus I combined them into one process, so at least if it crashes, it'll be a lot more obvious (Pulse queue will start growing), and we won't lose any messages. I didn't implement any sort of caching here, but the easiest way to do this would be again to let RabbitMQ do the work--the message should only be acknowledged once it is successfully published (or we give up). Thus if pulsetranslator is shut down while trying to publish a message, the original message will be unacked and therefore consumed again when it restarts. I also added timestamps to the logs, improved the README a little, and did a few miscellaneous cleanups.
Attachment #8518608 - Flags: review?(jgriffin)
We had this issue again today, with the Firefox 34 beta 7 builds. There was a timeot and we didn't receive the notifications
Yeah looks like it half-crashed again. We should be able to get this fix out today.
What I really wonder is why the hell all this starts right now out of a sudden!? This code is in for so long and never caused those massive problems. Is that caused by some system updates on the pulsetranslator machine, or really the suspicious network issues? But good to hear that we can get it fixed today! We really appreciate that. Thanks
The crash is due to an unexpected value of the 'platform' field (None). My guess is that something was changed in buildbot that started sending these out.
(In reply to Mark Côté [:mcote] from comment #14) > The crash is due to an unexpected value of the 'platform' field (None). My > guess is that something was changed in buildbot that started sending these > out. Do you have such a message available, and can you attach it to the bug please? I would like to know more about it. Thanks.
Flags: needinfo?(mcote)
Comment on attachment 8518608 [details] [review] GitHub PR r+, but if this causes us to trigger nagios alerts, we may have to back out and re-implement with a threading queue.
Attachment #8518608 - Flags: review?(jgriffin) → review+
Okay it's not what I thought. The child process wasn't crashing; it was just getting stuck for really long periods of time--so that explains the very delayed messages whimboo was seeing. The delays that I saw were during ssl setup or tear down. I disabled ssl for the pulsetranslator publisher, and I haven't seen it get stuck since. So yeah, further evidence that weird things are happening on that network.
Flags: needinfo?(mcote)
Interesting news Mark! I will forward this information in the networking bug once I have an answer to my next question... Which port is the SSL connecion using on the Pulse side? I would need that info give networking people the right filter to search with. Btw. would disabling SSL for pulsetranslator also require to change pulsebuildmonitor to not use SSL for now? I ask because it will be a different port and in our network we currently only have: dest: pulse.mozilla.org ; port 5672
Flags: needinfo?(mcote)
Summary: pulsetranslator completely failing to publish normalized messages → [pulsetranslator] With SSL enabled publishing of normalized messages get stuck
Regular (non-SSL) is on port 5672; ssl is on port 5671. Every connection to Pulse is separate, so the fact that pulsetranslator is currently publishing on the non-SSL port has no effect on any SSL connections from consumers or other publishers.
Flags: needinfo?(mcote)
(In reply to Mark Côté [:mcote] from comment #19) > Every connection to Pulse is separate, so the fact that pulsetranslator is > currently publishing on the non-SSL port has no effect on any SSL > connections from consumers or other publishers. Sure, but other consumers can have the same issue with SSL connections and would not receive messages. That's why I'm asking. So it looks like that we have to get our ACLs for Mozmill and TPS updated to include 5672 too.
I filed bug 1102454 to get our ACLs enabled for SSL connections via port 5671. I will shortly update the network bug regarding the timeouts.
Whiteboard: [qa-automation-blocked]
We got some weird notification today. Received on the 24th for a presumably build form the 22nd (with some fields correct, see "previous_buildid" which is indeed yesterday's build).
Mark, can you please check if this was a fault in pulsetranslator or the received message is correct, and the fault lies with the build itself.
Flags: needinfo?(mcote)
Mark, what Andrei missed to add is, that we already got such a notification yesterday. So we are a bit surprised about this additional one coming in nearly a day later. We had this in the past already, so I wonder if we sometimes produce duplicates in pulsetranslator. If not we may receive those dupes from pulse directly?
I suppose it *might* be possible for a duplicate to be sent, if an exception occurred after the message was sent while the connection was being closed or thereabouts. It coming a day later, however, it very weird, unless it was unable to send any messages for a whole day. Perhaps we need an independent system listening to both raw and normalized build messages and correlating them, issuing alerts when a message should be received but isn't or is received more than once. Given all the weirdness lately, it's probably a good idea. Shouldn't be too hard... I'll find time, or perhaps find a contributor.
Flags: needinfo?(mcote)
Filed bug 1104911 about getting some tests.
Depends on: 1104911
Forgot to mention that I landed the patch to remove process separation a while ago, which helped me track down the SSL issue: https://github.com/mozilla/pulsetranslator/commit/2c34b9ac863d55de1f963e845ce9a42f4d8bcd7a
(In reply to Mark Côté [:mcote] from comment #25) > Perhaps we need an independent system listening to both raw and normalized > build messages and correlating them, issuing alerts when a message should be > received but isn't or is received more than once. Given all the weirdness > lately, it's probably a good idea. Shouldn't be too hard... I'll find time, > or perhaps find a contributor. That would be great. Sadly I won't have the time for that in the next weeks. :/
Just a quick update here. Since we have disabled SSL we never missed a build message again! So SSL is definitely to blame.
We've now moved Pulse to CloudAMQP and pulsetranslator to Heroku. It is back to using SSL, as whatever was causing problems in the datacentre shouldn't be an issue any longer. Please reopen if you see missing messages.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
(In reply to Mark Côté [:mcote] from comment #30) > We've now moved Pulse to CloudAMQP and pulsetranslator to Heroku. It is This move actually broke our whole mozmill-ci setup and we are no longer able to access Pulse due to restrictions for networking routes in SCL3. This move should have been announced to consumers of Pulse. Also I do not see any single bug in the Bugzilla Pulse component about this move. :( I think with that move we got at least one new host for which we have to get an ACL entry added.
Flags: needinfo?(mcote)
Flags: needinfo?(jgriffin)
Btw I handle our currently broken state on bug 1215464.
And bug 1205867 is actually the regressor. So I will move ni? and all others over to our bug 1215464 and bug 1205867.
Flags: needinfo?(mcote)
Flags: needinfo?(jgriffin)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: