Closed
Bug 1196915
Opened 9 years ago
Closed 8 years ago
Publish hg.mozilla.org pushes to Pulse
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect)
Developer Services
Mercurial: hg.mozilla.org
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gps, Assigned: gps)
References
Details
Attachments
(1 file, 2 obsolete files)
842 bytes,
patch
|
gps
:
review+
|
Details | Diff | Splinter Review |
A long, long time ago hg.mozilla.org published data to Pulse during pushes. This was removed for reasons that happened before my time. A number of consumers are polling pushlog APIs quite frequently to determine when new pushes are available. In their zeal to have fast turnaround times, the poll frequency can be rather high (every 10s or so). Push-based notifications will almost certainly have much lower latency than polling, without the high server load. A number of people have asked for this recently, so let's have hg.mozilla.org publish to Pulse. Now, network services are network services and things can and will go wrong. I will not guarantee that all pushes to hg.mozilla.org will successfully publish to Pulse. I cannot guarantee that Pulse will not lose data. So, push-based delivery should be considered best effort (with hopefully a 99+% success rate). As such, downstream consumers *must continue to poll hg.mozilla.org pushlogs* to ensure things don't fall through the cracks. However, they should be able to dramatically decrease the poll rate - to say every 60s or even every few minutes - because Pulse should deliver most data milliseconds after it is produced.
Comment 1•9 years ago
|
||
Sounds like a dupe of bug 1022701. Also relevant: bug 665118
Assignee | ||
Comment 2•9 years ago
|
||
In true fashion, I implemented this in a few hours yesterday but am currently toiling over the tests. I'm trying to write tests that simulate various network conditions so I have confidence that connectivity issues to pulse don't block pushes. I'm rather insistent that pushes to hg.mozilla.org don't have any hard 3rd party dependencies.
Assignee | ||
Comment 3•9 years ago
|
||
So, I'm writing tests simulating slow networks to test that we handle timeouts properly. Yet all my tests are taking 30s to time out despite me setting timeouts to 5s. Digging around the source code for amqp and I find this code from the connection establishing function: try: self.sock.settimeout(None) self.sock.setsockopt(SOL_TCP, socket.TCP_NODELAY, 1) self.sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1) self._setup_transport() self._write(AMQP_PROTOCOL_HEADER) except (OSError, IOError, socket.error) as exc: if get_errno(exc) not in _UNAVAIL: self.connected = False raise sock.settimeout(None) followed by a write(). That's how you create excessively long timeouts. So Python's most popular AMQP library doesn't properly implement timeouts. I'll poke around PyPI before spinning up another thread or forking. Oy.
Comment 4•9 years ago
|
||
Why not spawn a subprocess that won't stop until it succeeds and not wait for pulse during the pushes themselves?
Assignee | ||
Comment 5•9 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #4) > Why not spawn a subprocess that won't stop until it succeeds and not wait > for pulse during the pushes themselves? Because I don't like unparented processes lingering on servers. I'm going to rewrite this hook once the new replication is in place anyway. Once the new replication is in place, I'll be able to guarantee that Pulse events will eventually fire.
Assignee | ||
Comment 6•9 years ago
|
||
Also, I've been doing cross-platform programming for too long that I forgot I can just set SIGALRM to trap timeouts in I/O. So I got timeouts working properly. Just polishing off tests now.
Assignee | ||
Comment 7•9 years ago
|
||
And yes, we could put files in a directory and use a daemon process to retry publishing to Pulse until success. This is conceptually similar to what I'm going to use for replication. Furthermore, the new replication setup will be asynchronous with push so push times are shorter. Pulse publishing needs to happen after replication or there are race conditions. So Pulse publishing in the future needs to be tied into the new replication system, which will not be as simple as files in a directory and I don't want to duplicate effort by implementing something more sophisticated today. We already have consumers that poll, so I don't think it is a big deal if Pulse publishing isn't guaranteed at this juncture. Please correct me if I'm wrong.
Assignee | ||
Comment 8•9 years ago
|
||
testing: start Pulse container as part of hg cluster (bug 1196915); r?smacleod We're about to add Pulse integration to hg.mozilla.org. Before we can do this, we need the Docker testing cluster to run a Pulse server so we can test the new features. Fortunately, we've already implemented a Pulse Docker container as part of the MozReview testing environment. So, we re-use that image. "All" we need to do is hook up the Pulse container to the hg cluster. This is a lot of code changes, but it's pretty boilerplate.
Assignee | ||
Comment 9•9 years ago
|
||
hgserver: publish pushes to Pulse (bug 1196915) THIS COMMIT SHOULD BE SPLIT UP BEFORE REVIEW.
Assignee | ||
Comment 10•8 years ago
|
||
https://hg.mozilla.org/hgcustom/version-control-tools/rev/fadab2daf32766eac0a83204cede50d94c0d67bc testing: start Pulse container as part of hg cluster (bug 1196915); r?smacleod https://hg.mozilla.org/hgcustom/version-control-tools/rev/63e64894b04bd835abd51d70ec4bc43410f5caca vcsreplicator: publish Pulse notifications for Mercurial pushes (bug 1196915)
Assignee | ||
Comment 11•8 years ago
|
||
https://tools.taskcluster.net/pulse-inspector/#!((exchange:exchange/hgpushes/v1,routingKeyPattern:%23)) The server is working its way through the backlog of notifications. It should be done in a few minutes. After that, Pulse messages are guaranteed at least once delivery. I'll write up more docs soon. This is experimental until this bug is closed: we don't have monitoring on this yet, so it could stop working at any time.
Assignee | ||
Comment 12•8 years ago
|
||
https://hg.mozilla.org/hgcustom/version-control-tools/rev/ba4a16960322bc091c4d43c771b25a99e814c955 docs: document hg.mozilla.org Pulse notifications (bug 1196915)
Assignee | ||
Comment 13•8 years ago
|
||
Docs up at https://mozilla-version-control-tools.readthedocs.org/en/latest/hgmo/notifications.html
Assignee | ||
Updated•8 years ago
|
Attachment #8651324 -
Attachment is obsolete: true
Assignee | ||
Updated•8 years ago
|
Attachment #8651325 -
Attachment is obsolete: true
Assignee | ||
Comment 14•8 years ago
|
||
OK, the backlog finally cleared. It appears it takes 2-3s after push for replication to complete and the Pulse message to send. Hopefully that's instantaneous enough for most people :)
Comment 15•8 years ago
|
||
That's better than what we have now, based on buildbot polling the various repos. When do you expect this to be production ready, so that I can switch pulsebot to it? :)
Comment 16•8 years ago
|
||
https://hg.mozilla.org/hgcustom/version-control-tools/rev/ba4a16960322bc091c4d43c771b25a99e814c955#l2.19 misses the "))" Looking at the messages, I notice that push.time and _meta.sent use different datetime formats. I noticed because I thought I could easily figure out the lag ;-) . I think it'd be good to make that consistent. I like that the push data is minimal, and one can hop over to hg.m.o to get details. Though, given that it's a list of pushes you get the data for, the url could be in the payload, and not on the push? No idea if that's helpful or counter-productive. Just rambling. Just noticed this one flying by: root:{} 2 items payload:{} 3 items pushlog_pushes:[] 0 items heads:[] 1 item repo_url:https://hg.mozilla.org/experimental/firefox-unified-stage _meta:{} 4 items sent:2016-04-21T08:03:07.672240 routing_key:hg.push.1 serializer:json exchange:exchange/hgpushes/v1 (shortly after an fx-team push, https://hg.mozilla.org/integration/fx-team/json-pushes?version=2&full=1&startID=18840&endID=18841) https://hg.mozilla.org/experimental/firefox-unified had the same issue. The docs say that that's OK, but could you add in which cases that's OK?
Comment 17•8 years ago
|
||
Thank you very much for making this work. The few seconds lag is more than acceptable. Thanks! Could the routing key be something different than "hg.push.1"? If it was the repo path, I could simply listen to a specific repo (e.g. build.buildbot-configs).
Assignee | ||
Comment 18•8 years ago
|
||
https://hg.mozilla.org/hgcustom/version-control-tools/rev/549a343ecdd30e3d1d0f29497970d6626496111d vcsreplicator: use repository relative path as routing key (bug 1196915)
Assignee | ||
Comment 19•8 years ago
|
||
(In reply to Axel Hecht [:Pike] from comment #16) > https://hg.mozilla.org/hgcustom/version-control-tools/rev/ > ba4a16960322bc091c4d43c771b25a99e814c955#l2.19 misses the "))" > > Looking at the messages, I notice that push.time and _meta.sent use > different datetime formats. I noticed because I thought I could easily > figure out the lag ;-) . I think it'd be good to make that consistent. _meta is a convention used from the Pulse producer library. I would ignore it completely. > I like that the push data is minimal, and one can hop over to hg.m.o to get > details. Though, given that it's a list of pushes you get the data for, the > url could be in the payload, and not on the push? No idea if that's helpful > or counter-productive. Just rambling. I'm not sure what you mean here. > > Just noticed this one flying by: > > root:{} 2 items > payload:{} 3 items > pushlog_pushes:[] 0 items > heads:[] 1 item > repo_url:https://hg.mozilla.org/experimental/firefox-unified-stage > _meta:{} 4 items > sent:2016-04-21T08:03:07.672240 > routing_key:hg.push.1 > serializer:json > exchange:exchange/hgpushes/v1 > > (shortly after an fx-team push, > https://hg.mozilla.org/integration/fx-team/json- > pushes?version=2&full=1&startID=18840&endID=18841) > https://hg.mozilla.org/experimental/firefox-unified had the same issue. > > The docs say that that's OK, but could you add in which cases that's OK? I'll update the docs. tl;dr it is because of the "experimental" part in the URL. Those repos are special :) I don't expect to see this on regular repos.
Assignee | ||
Comment 20•8 years ago
|
||
(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #17) > Could the routing key be something different than "hg.push.1"? > If it was the repo path, I could simply listen to a specific repo (e.g. > build.buildbot-configs). Makes sense. The routing key is now the relative path of the repository on hg.mozilla.org. This includes "/". Not sure if that character has special meaning in routing keys. If so, we can normalize somehow.
Assignee | ||
Comment 21•8 years ago
|
||
https://hg.mozilla.org/hgcustom/version-control-tools/rev/171289dbbc4e227c838ec0e938b8b08485d5f089 ansible/hg-ssh: configure Nagios check for pulse notifier lag (bug 1196915)
Assignee | ||
Comment 22•8 years ago
|
||
fubar: can you please enable the check_pulsenotifier_lag Nagios check on hgssh3? It should be configured just like the existing kafka/zookeeper checks on that machine in terms of interval and notification settings.
Flags: needinfo?(klibby)
Comment 23•8 years ago
|
||
done. will need mana page and docs on what to do when it stops working.
Flags: needinfo?(klibby)
Assignee | ||
Comment 24•8 years ago
|
||
https://mana.mozilla.org/wiki/display/NAGIOS/hg+pulse+notifier+lag and https://mozilla-version-control-tools.readthedocs.org/en/latest/hgmo/ops.html#check-pulsenotifier-lag exist. The former could use a bit more love.
Assignee | ||
Comment 25•8 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #15) > That's better than what we have now, based on buildbot polling the various > repos. When do you expect this to be production ready, so that I can switch > pulsebot to it? :) We now have a monitor verifying we're writing to Pulse. I think it is safe to switch pulsebot to this. For more important consumers (like build automation), I'd still wait until this bug is closed.
Updated•8 years ago
|
Assignee | ||
Updated•8 years ago
|
Attachment #8744406 -
Flags: review?(gps) → review+
Comment 27•8 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #25) > (In reply to Mike Hommey [:glandium] from comment #15) > > That's better than what we have now, based on buildbot polling the various > > repos. When do you expect this to be production ready, so that I can switch > > pulsebot to it? :) > > We now have a monitor verifying we're writing to Pulse. I think it is safe > to switch pulsebot to this. > > For more important consumers (like build automation), I'd still wait until > this bug is closed. It looks like some events might go missing. I didn't receive an event for https://hg.mozilla.org/try/rev/64d70556b384.
Assignee | ||
Comment 28•8 years ago
|
||
I found an exception in the logs that may indicate a problem. I'll look into this on my Monday.
Assignee | ||
Comment 29•8 years ago
|
||
I think I fixed the bug causing messages to get lost. Also, that message queue was a bit behind because we had a few servers reboot suddenly over the weekend due to a buggy kernel module. The Pulse notifications were subsequently behind. We missed the alerts because we were dealing with the more important sudden server restart syndrome :/ Everything should be unwedged now. Please let me know if you see anything wonky. I'll keep my eye on server logs.
Assignee | ||
Comment 30•8 years ago
|
||
Actually, it's the beginning of the week and we have notifications that Pulse delivery is working as expected. I'm going to mark this resolved and softly recommend consumers use it. If I were someone important like Treeherder, Buildbot, or TaskCluster, I'd: a) wait for someone like pulsebot to confirm it is working without issue b) deploy Pulse-based notifications as a supplement - not a replacement - to existing pushlog querying. Once Pulse delivery has been confirmed as robust, you can consider killing regular pushlog based querying.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 31•8 years ago
|
||
There was a several hour Pulse message delivery lag that just ended. Details in bug 1267619. Underlying issue was a bug in the new notifications code on hg.mozilla.org. The double good news is the bug is fixed and we eventually delivered all messages to Pulse, so no data was lost. (This is why pulsebot spewed a bunch of delayed messages in #developers a few minutes ago.) This is more or less how queue systems are supposed to work (modulo the delivery delay of course). Yay for no data loss.
You need to log in
before you can comment on or make changes to this bug.
Description
•