Closed Bug 1196915 Opened 9 years ago Closed 8 years ago

Publish hg.mozilla.org pushes to Pulse

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(1 file, 2 obsolete files)

A long, long time ago hg.mozilla.org published data to Pulse during pushes. This was removed for reasons that happened before my time.

A number of consumers are polling pushlog APIs quite frequently to determine when new pushes are available. In their zeal to have fast turnaround times, the poll frequency can be rather high (every 10s or so).

Push-based notifications will almost certainly have much lower latency than polling, without the high server load. A number of people have asked for this recently, so let's have hg.mozilla.org publish to Pulse.

Now, network services are network services and things can and will go wrong. I will not guarantee that all pushes to hg.mozilla.org will successfully publish to Pulse. I cannot guarantee that Pulse will not lose data. So, push-based delivery should be considered best effort (with hopefully a 99+% success rate). As such, downstream consumers *must continue to poll hg.mozilla.org pushlogs* to ensure things don't fall through the cracks. However, they should be able to dramatically decrease the poll rate - to say every 60s or even every few minutes - because Pulse should deliver most data milliseconds after it is produced.
Sounds like a dupe of bug 1022701. Also relevant: bug 665118
In true fashion, I implemented this in a few hours yesterday but am currently toiling over the tests. I'm trying to write tests that simulate various network conditions so I have confidence that connectivity issues to pulse don't block pushes. I'm rather insistent that pushes to hg.mozilla.org don't have any hard 3rd party dependencies.
So, I'm writing tests simulating slow networks to test that we handle timeouts properly. Yet all my tests are taking 30s to time out despite me setting timeouts to 5s. Digging around the source code for amqp and I find this code from the connection establishing function:

        try:
            self.sock.settimeout(None)
            self.sock.setsockopt(SOL_TCP, socket.TCP_NODELAY, 1)
            self.sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)

            self._setup_transport()

            self._write(AMQP_PROTOCOL_HEADER)
        except (OSError, IOError, socket.error) as exc:
            if get_errno(exc) not in _UNAVAIL:
                self.connected = False
            raise

sock.settimeout(None) followed by a write(). That's how you create excessively long timeouts.

So Python's most popular AMQP library doesn't properly implement timeouts. I'll poke around PyPI before spinning up another thread or forking. Oy.
Why not spawn a subprocess that won't stop until it succeeds and not wait for pulse during the pushes themselves?
(In reply to Mike Hommey [:glandium] from comment #4)
> Why not spawn a subprocess that won't stop until it succeeds and not wait
> for pulse during the pushes themselves?

Because I don't like unparented processes lingering on servers.

I'm going to rewrite this hook once the new replication is in place anyway. Once the new replication is in place, I'll be able to guarantee that Pulse events will eventually fire.
Also, I've been doing cross-platform programming for too long that I forgot I can just set SIGALRM to trap timeouts in I/O. So I got timeouts working properly. Just polishing off tests now.
And yes, we could put files in a directory and use a daemon process to retry publishing to Pulse until success. This is conceptually similar to what I'm going to use for replication. Furthermore, the new replication setup will be asynchronous with push so push times are shorter. Pulse publishing needs to happen after replication or there are race conditions. So Pulse publishing in the future needs to be tied into the new replication system, which will not be as simple as files in a directory and I don't want to duplicate effort by implementing something more sophisticated today. We already have consumers that poll, so I don't think it is a big deal if Pulse publishing isn't guaranteed at this juncture. Please correct me if I'm wrong.
testing: start Pulse container as part of hg cluster (bug 1196915); r?smacleod

We're about to add Pulse integration to hg.mozilla.org. Before we can do
this, we need the Docker testing cluster to run a Pulse server so we can
test the new features.

Fortunately, we've already implemented a Pulse Docker container as part
of the MozReview testing environment. So, we re-use that image. "All" we
need to do is hook up the Pulse container to the hg cluster. This is a
lot of code changes, but it's pretty boilerplate.
hgserver: publish pushes to Pulse (bug 1196915)

THIS COMMIT SHOULD BE SPLIT UP BEFORE REVIEW.
Depends on: 1237811
https://tools.taskcluster.net/pulse-inspector/#!((exchange:exchange/hgpushes/v1,routingKeyPattern:%23))

The server is working its way through the backlog of notifications. It should be done in a few minutes. After that, Pulse messages are guaranteed at least once delivery. I'll write up more docs soon.

This is experimental until this bug is closed: we don't have monitoring on this yet, so it could stop working at any time.
Attachment #8651324 - Attachment is obsolete: true
Attachment #8651325 - Attachment is obsolete: true
OK, the backlog finally cleared. It appears it takes 2-3s after push for replication to complete and the Pulse message to send. Hopefully that's instantaneous enough for most people :)
That's better than what we have now, based on buildbot polling the various repos. When do you expect this to be production ready, so that I can switch pulsebot to it? :)
https://hg.mozilla.org/hgcustom/version-control-tools/rev/ba4a16960322bc091c4d43c771b25a99e814c955#l2.19 misses the "))"

Looking at the messages, I notice that push.time and _meta.sent use different datetime formats. I noticed because I thought I could easily figure out the lag ;-) . I think it'd be good to make that consistent.

I like that the push data is minimal, and one can hop over to hg.m.o to get details. Though, given that it's a list of pushes you get the data for, the url could be in the payload, and not on the push? No idea if that's helpful or counter-productive. Just rambling.

Just noticed this one flying by:

root:{} 2 items
payload:{} 3 items
pushlog_pushes:[] 0 items
heads:[] 1 item
repo_url:https://hg.mozilla.org/experimental/firefox-unified-stage
_meta:{} 4 items
sent:2016-04-21T08:03:07.672240
routing_key:hg.push.1
serializer:json
exchange:exchange/hgpushes/v1

(shortly after an fx-team push, https://hg.mozilla.org/integration/fx-team/json-pushes?version=2&full=1&startID=18840&endID=18841) https://hg.mozilla.org/experimental/firefox-unified had the same issue.

The docs say that that's OK, but could you add in which cases that's OK?
Thank you very much for making this work. The few seconds lag is more than acceptable. Thanks!

Could the routing key be something different than "hg.push.1"?
If it was the repo path, I could simply listen to a specific repo (e.g. build.buildbot-configs).
(In reply to Axel Hecht [:Pike] from comment #16)
> https://hg.mozilla.org/hgcustom/version-control-tools/rev/
> ba4a16960322bc091c4d43c771b25a99e814c955#l2.19 misses the "))"
> 
> Looking at the messages, I notice that push.time and _meta.sent use
> different datetime formats. I noticed because I thought I could easily
> figure out the lag ;-) . I think it'd be good to make that consistent.

_meta is a convention used from the Pulse producer library. I would ignore it completely.

> I like that the push data is minimal, and one can hop over to hg.m.o to get
> details. Though, given that it's a list of pushes you get the data for, the
> url could be in the payload, and not on the push? No idea if that's helpful
> or counter-productive. Just rambling.

I'm not sure what you mean here.

> 
> Just noticed this one flying by:
> 
> root:{} 2 items
> payload:{} 3 items
> pushlog_pushes:[] 0 items
> heads:[] 1 item
> repo_url:https://hg.mozilla.org/experimental/firefox-unified-stage
> _meta:{} 4 items
> sent:2016-04-21T08:03:07.672240
> routing_key:hg.push.1
> serializer:json
> exchange:exchange/hgpushes/v1
> 
> (shortly after an fx-team push,
> https://hg.mozilla.org/integration/fx-team/json-
> pushes?version=2&full=1&startID=18840&endID=18841)
> https://hg.mozilla.org/experimental/firefox-unified had the same issue.
> 
> The docs say that that's OK, but could you add in which cases that's OK?

I'll update the docs. tl;dr it is because of the "experimental" part in the URL. Those repos are special :) I don't expect to see this on regular repos.
(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #17)
> Could the routing key be something different than "hg.push.1"?
> If it was the repo path, I could simply listen to a specific repo (e.g.
> build.buildbot-configs).

Makes sense. The routing key is now the relative path of the repository on hg.mozilla.org. This includes "/". Not sure if that character has special meaning in routing keys. If so, we can normalize somehow.
fubar: can you please enable the check_pulsenotifier_lag Nagios check on hgssh3? It should be configured just like the existing kafka/zookeeper checks on that machine in terms of interval and notification settings.
Flags: needinfo?(klibby)
done. will need mana page and docs on what to do when it stops working.
Flags: needinfo?(klibby)
(In reply to Mike Hommey [:glandium] from comment #15)
> That's better than what we have now, based on buildbot polling the various
> repos. When do you expect this to be production ready, so that I can switch
> pulsebot to it? :)

We now have a monitor verifying we're writing to Pulse. I think it is safe to switch pulsebot to this.

For more important consumers (like build automation), I'd still wait until this bug is closed.
Blocks: 1266559
Blocks: 1065567
No longer blocks: 1266559
Docs patch to fix link.
Attachment #8744406 - Flags: review?(gps)
Attachment #8744406 - Flags: review?(gps) → review+
(In reply to Gregory Szorc [:gps] from comment #25)
> (In reply to Mike Hommey [:glandium] from comment #15)
> > That's better than what we have now, based on buildbot polling the various
> > repos. When do you expect this to be production ready, so that I can switch
> > pulsebot to it? :)
> 
> We now have a monitor verifying we're writing to Pulse. I think it is safe
> to switch pulsebot to this.
> 
> For more important consumers (like build automation), I'd still wait until
> this bug is closed.

It looks like some events might go missing. I didn't receive an event for https://hg.mozilla.org/try/rev/64d70556b384.
I found an exception in the logs that may indicate a problem. I'll look into this on my Monday.
I think I fixed the bug causing messages to get lost.

Also, that message queue was a bit behind because we had a few servers reboot suddenly over the weekend due to a buggy kernel module. The Pulse notifications were subsequently behind. We missed the alerts because we were dealing with the more important sudden server restart syndrome :/ Everything should be unwedged now. Please let me know if you see anything wonky. I'll keep my eye on server logs.
Actually, it's the beginning of the week and we have notifications that Pulse delivery is working as expected. I'm going to mark this resolved and softly recommend consumers use it.

If I were someone important like Treeherder, Buildbot, or TaskCluster, I'd:

a) wait for someone like pulsebot to confirm it is working without issue
b) deploy Pulse-based notifications as a supplement - not a replacement - to existing pushlog querying. Once Pulse delivery has been confirmed as robust, you can consider killing regular pushlog based querying.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Blocks: 1267520
There was a several hour Pulse message delivery lag that just ended. Details in bug 1267619. Underlying issue was a bug in the new notifications code on hg.mozilla.org. The double good news is the bug is fixed and we eventually delivered all messages to Pulse, so no data was lost. (This is why pulsebot spewed a bunch of delayed messages in #developers a few minutes ago.) This is more or less how queue systems are supposed to work (modulo the delivery delay of course). Yay for no data loss.
Blocks: 1049530
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: