1196915 - Publish hg.mozilla.org pushes to Pulse

Assignee

Description

•

9 years ago

A long, long time ago hg.mozilla.org published data to Pulse during pushes. This was removed for reasons that happened before my time.

A number of consumers are polling pushlog APIs quite frequently to determine when new pushes are available. In their zeal to have fast turnaround times, the poll frequency can be rather high (every 10s or so).

Push-based notifications will almost certainly have much lower latency than polling, without the high server load. A number of people have asked for this recently, so let's have hg.mozilla.org publish to Pulse.

Now, network services are network services and things can and will go wrong. I will not guarantee that all pushes to hg.mozilla.org will successfully publish to Pulse. I cannot guarantee that Pulse will not lose data. So, push-based delivery should be considered best effort (with hopefully a 99+% success rate). As such, downstream consumers *must continue to poll hg.mozilla.org pushlogs* to ensure things don't fall through the cracks. However, they should be able to dramatically decrease the poll rate - to say every 60s or even every few minutes - because Pulse should deliver most data milliseconds after it is produced.

Mike Hommey [:glandium]

Comment 1

•

9 years ago

Sounds like a dupe of bug 1022701. Also relevant: bug 665118

Gregory Szorc [:gps]

Assignee

Comment 2

•

9 years ago

In true fashion, I implemented this in a few hours yesterday but am currently toiling over the tests. I'm trying to write tests that simulate various network conditions so I have confidence that connectivity issues to pulse don't block pushes. I'm rather insistent that pushes to hg.mozilla.org don't have any hard 3rd party dependencies.

Gregory Szorc [:gps]

Assignee

Comment 3

•

9 years ago

So, I'm writing tests simulating slow networks to test that we handle timeouts properly. Yet all my tests are taking 30s to time out despite me setting timeouts to 5s. Digging around the source code for amqp and I find this code from the connection establishing function:

        try:
            self.sock.settimeout(None)
            self.sock.setsockopt(SOL_TCP, socket.TCP_NODELAY, 1)
            self.sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)

            self._setup_transport()

            self._write(AMQP_PROTOCOL_HEADER)
        except (OSError, IOError, socket.error) as exc:
            if get_errno(exc) not in _UNAVAIL:
                self.connected = False
            raise

sock.settimeout(None) followed by a write(). That's how you create excessively long timeouts.

So Python's most popular AMQP library doesn't properly implement timeouts. I'll poke around PyPI before spinning up another thread or forking. Oy.

Mike Hommey [:glandium]

Comment 4

•

9 years ago

Why not spawn a subprocess that won't stop until it succeeds and not wait for pulse during the pushes themselves?

Gregory Szorc [:gps]

Assignee

Comment 5

•

9 years ago

(In reply to Mike Hommey [:glandium] from comment #4)
> Why not spawn a subprocess that won't stop until it succeeds and not wait
> for pulse during the pushes themselves?

Because I don't like unparented processes lingering on servers.

I'm going to rewrite this hook once the new replication is in place anyway. Once the new replication is in place, I'll be able to guarantee that Pulse events will eventually fire.

Gregory Szorc [:gps]

Assignee

Comment 6

•

9 years ago

Also, I've been doing cross-platform programming for too long that I forgot I can just set SIGALRM to trap timeouts in I/O. So I got timeouts working properly. Just polishing off tests now.

Gregory Szorc [:gps]

Assignee

Comment 7

•

9 years ago

And yes, we could put files in a directory and use a daemon process to retry publishing to Pulse until success. This is conceptually similar to what I'm going to use for replication. Furthermore, the new replication setup will be asynchronous with push so push times are shorter. Pulse publishing needs to happen after replication or there are race conditions. So Pulse publishing in the future needs to be tied into the new replication system, which will not be as simple as files in a directory and I don't want to duplicate effort by implementing something more sophisticated today. We already have consumers that poll, so I don't think it is a big deal if Pulse publishing isn't guaranteed at this juncture. Please correct me if I'm wrong.

Gregory Szorc [:gps]

Assignee

Comment 8

•

9 years ago

Attached file MozReview Request: testing: start Pulse container as part of hg cluster (bug 1196915); r?smacleod (obsolete) — Details

testing: start Pulse container as part of hg cluster (bug 1196915); r?smacleod

We're about to add Pulse integration to hg.mozilla.org. Before we can do
this, we need the Docker testing cluster to run a Pulse server so we can
test the new features.

Fortunately, we've already implemented a Pulse Docker container as part
of the MozReview testing environment. So, we re-use that image. "All" we
need to do is hook up the Pulse container to the hg cluster. This is a
lot of code changes, but it's pretty boilerplate.

Gregory Szorc [:gps]

Assignee

Comment 9

•

9 years ago

Attached file MozReview Request: hgserver: publish pushes to Pulse (bug 1196915) (obsolete) — Details

hgserver: publish pushes to Pulse (bug 1196915)

THIS COMMIT SHOULD BE SPLIT UP BEFORE REVIEW.

Gregory Szorc [:gps]

Assignee

Updated

•

8 years ago

Depends on: 1237811

Gregory Szorc [:gps]

Assignee

Comment 10

•

8 years ago

https://hg.mozilla.org/hgcustom/version-control-tools/rev/fadab2daf32766eac0a83204cede50d94c0d67bc
testing: start Pulse container as part of hg cluster (bug 1196915); r?smacleod

https://hg.mozilla.org/hgcustom/version-control-tools/rev/63e64894b04bd835abd51d70ec4bc43410f5caca
vcsreplicator: publish Pulse notifications for Mercurial pushes (bug 1196915)

Gregory Szorc [:gps]

Assignee

Comment 11

•

8 years ago

https://tools.taskcluster.net/pulse-inspector/#!((exchange:exchange/hgpushes/v1,routingKeyPattern:%23))

The server is working its way through the backlog of notifications. It should be done in a few minutes. After that, Pulse messages are guaranteed at least once delivery. I'll write up more docs soon.

This is experimental until this bug is closed: we don't have monitoring on this yet, so it could stop working at any time.

Gregory Szorc [:gps]

Assignee

Comment 12

•

8 years ago

https://hg.mozilla.org/hgcustom/version-control-tools/rev/ba4a16960322bc091c4d43c771b25a99e814c955
docs: document hg.mozilla.org Pulse notifications (bug 1196915)

Gregory Szorc [:gps]

Assignee

Comment 13

•

8 years ago

Docs up at https://mozilla-version-control-tools.readthedocs.org/en/latest/hgmo/notifications.html

Gregory Szorc [:gps]

Assignee

Updated

•

8 years ago

Attachment #8651324 - Attachment is obsolete: true

Gregory Szorc [:gps]

Assignee

Updated

•

8 years ago

Attachment #8651325 - Attachment is obsolete: true

Gregory Szorc [:gps]

Assignee

Comment 14

•

8 years ago

OK, the backlog finally cleared. It appears it takes 2-3s after push for replication to complete and the Pulse message to send. Hopefully that's instantaneous enough for most people :)

Mike Hommey [:glandium]

Comment 15

•

8 years ago

That's better than what we have now, based on buildbot polling the various repos. When do you expect this to be production ready, so that I can switch pulsebot to it? :)

Axel Hecht [:Pike]

Comment 16

•

8 years ago

https://hg.mozilla.org/hgcustom/version-control-tools/rev/ba4a16960322bc091c4d43c771b25a99e814c955#l2.19 misses the "))"

Looking at the messages, I notice that push.time and _meta.sent use different datetime formats. I noticed because I thought I could easily figure out the lag ;-) . I think it'd be good to make that consistent.

I like that the push data is minimal, and one can hop over to hg.m.o to get details. Though, given that it's a list of pushes you get the data for, the url could be in the payload, and not on the push? No idea if that's helpful or counter-productive. Just rambling.

Just noticed this one flying by:

root:{} 2 items
payload:{} 3 items
pushlog_pushes:[] 0 items
heads:[] 1 item
repo_url:https://hg.mozilla.org/experimental/firefox-unified-stage
_meta:{} 4 items
sent:2016-04-21T08:03:07.672240
routing_key:hg.push.1
serializer:json
exchange:exchange/hgpushes/v1

(shortly after an fx-team push, https://hg.mozilla.org/integration/fx-team/json-pushes?version=2&full=1&startID=18840&endID=18841) https://hg.mozilla.org/experimental/firefox-unified had the same issue.

The docs say that that's OK, but could you add in which cases that's OK?

Armen [:armenzg]

Comment 17

•

8 years ago

Thank you very much for making this work. The few seconds lag is more than acceptable. Thanks!

Could the routing key be something different than "hg.push.1"?
If it was the repo path, I could simply listen to a specific repo (e.g. build.buildbot-configs).

Gregory Szorc [:gps]

Assignee

Comment 18

•

8 years ago

https://hg.mozilla.org/hgcustom/version-control-tools/rev/549a343ecdd30e3d1d0f29497970d6626496111d
vcsreplicator: use repository relative path as routing key (bug 1196915)

Gregory Szorc [:gps]

Assignee

Comment 19

•

8 years ago

(In reply to Axel Hecht [:Pike] from comment #16)
> https://hg.mozilla.org/hgcustom/version-control-tools/rev/
> ba4a16960322bc091c4d43c771b25a99e814c955#l2.19 misses the "))"
> 
> Looking at the messages, I notice that push.time and _meta.sent use
> different datetime formats. I noticed because I thought I could easily
> figure out the lag ;-) . I think it'd be good to make that consistent.

_meta is a convention used from the Pulse producer library. I would ignore it completely.

> I like that the push data is minimal, and one can hop over to hg.m.o to get
> details. Though, given that it's a list of pushes you get the data for, the
> url could be in the payload, and not on the push? No idea if that's helpful
> or counter-productive. Just rambling.

I'm not sure what you mean here.

> 
> Just noticed this one flying by:
> 
> root:{} 2 items
> payload:{} 3 items
> pushlog_pushes:[] 0 items
> heads:[] 1 item
> repo_url:https://hg.mozilla.org/experimental/firefox-unified-stage
> _meta:{} 4 items
> sent:2016-04-21T08:03:07.672240
> routing_key:hg.push.1
> serializer:json
> exchange:exchange/hgpushes/v1
> 
> (shortly after an fx-team push,
> https://hg.mozilla.org/integration/fx-team/json-
> pushes?version=2&full=1&startID=18840&endID=18841)
> https://hg.mozilla.org/experimental/firefox-unified had the same issue.
> 
> The docs say that that's OK, but could you add in which cases that's OK?

I'll update the docs. tl;dr it is because of the "experimental" part in the URL. Those repos are special :) I don't expect to see this on regular repos.

Gregory Szorc [:gps]

Assignee

Comment 20

•

8 years ago

(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #17)
> Could the routing key be something different than "hg.push.1"?
> If it was the repo path, I could simply listen to a specific repo (e.g.
> build.buildbot-configs).

Makes sense. The routing key is now the relative path of the repository on hg.mozilla.org. This includes "/". Not sure if that character has special meaning in routing keys. If so, we can normalize somehow.

Gregory Szorc [:gps]

Assignee

Comment 21

•

8 years ago

https://hg.mozilla.org/hgcustom/version-control-tools/rev/171289dbbc4e227c838ec0e938b8b08485d5f089
ansible/hg-ssh: configure Nagios check for pulse notifier lag (bug 1196915)

Gregory Szorc [:gps]

Assignee

Comment 22

•

8 years ago

fubar: can you please enable the check_pulsenotifier_lag Nagios check on hgssh3? It should be configured just like the existing kafka/zookeeper checks on that machine in terms of interval and notification settings.

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 23

•

8 years ago

done. will need mana page and docs on what to do when it stops working.

Flags: needinfo?(klibby)

Gregory Szorc [:gps]

Assignee

Comment 24

•

8 years ago

https://mana.mozilla.org/wiki/display/NAGIOS/hg+pulse+notifier+lag and https://mozilla-version-control-tools.readthedocs.org/en/latest/hgmo/ops.html#check-pulsenotifier-lag exist. The former could use a bit more love.

Gregory Szorc [:gps]

Assignee

Comment 25

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #15)
> That's better than what we have now, based on buildbot polling the various
> repos. When do you expect this to be production ready, so that I can switch
> pulsebot to it? :)

We now have a monitor verifying we're writing to Pulse. I think it is safe to switch pulsebot to this.

For more important consumers (like build automation), I'd still wait until this bug is closed.

Cameron Dawson [:camd]

Updated

•

8 years ago

Blocks: 1266559

Ed Morley [:emorley]

Updated

•

8 years ago

Blocks: 1065567
No longer blocks: 1266559

Axel Hecht [:Pike]

Comment 26

•

8 years ago

Attached patch fix the link in the docs — Details — Splinter Review

Docs patch to fix link.

Attachment #8744406 - Flags: review?(gps)

Gregory Szorc [:gps]

Assignee

Updated

•

8 years ago

Attachment #8744406 - Flags: review?(gps) → review+

Mike Hommey [:glandium]

Comment 27

•

8 years ago

(In reply to Gregory Szorc [:gps] from comment #25)
> (In reply to Mike Hommey [:glandium] from comment #15)
> > That's better than what we have now, based on buildbot polling the various
> > repos. When do you expect this to be production ready, so that I can switch
> > pulsebot to it? :)
> 
> We now have a monitor verifying we're writing to Pulse. I think it is safe
> to switch pulsebot to this.
> 
> For more important consumers (like build automation), I'd still wait until
> this bug is closed.

It looks like some events might go missing. I didn't receive an event for https://hg.mozilla.org/try/rev/64d70556b384.

Gregory Szorc [:gps]

Assignee

Comment 28

•

8 years ago

I found an exception in the logs that may indicate a problem. I'll look into this on my Monday.

Gregory Szorc [:gps]

Assignee

Comment 29

•

8 years ago

I think I fixed the bug causing messages to get lost.

Also, that message queue was a bit behind because we had a few servers reboot suddenly over the weekend due to a buggy kernel module. The Pulse notifications were subsequently behind. We missed the alerts because we were dealing with the more important sudden server restart syndrome :/ Everything should be unwedged now. Please let me know if you see anything wonky. I'll keep my eye on server logs.

Gregory Szorc [:gps]

Assignee

Comment 30

•

8 years ago

Actually, it's the beginning of the week and we have notifications that Pulse delivery is working as expected. I'm going to mark this resolved and softly recommend consumers use it.

If I were someone important like Treeherder, Buildbot, or TaskCluster, I'd:

a) wait for someone like pulsebot to confirm it is working without issue
b) deploy Pulse-based notifications as a supplement - not a replacement - to existing pushlog querying. Once Pulse delivery has been confirmed as robust, you can consider killing regular pushlog based querying.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Gregory Szorc [:gps]

Assignee

Updated

•

8 years ago

Blocks: 1267520

Gregory Szorc [:gps]

Assignee

Comment 31

•

8 years ago

There was a several hour Pulse message delivery lag that just ended. Details in bug 1267619. Underlying issue was a bug in the new notifications code on hg.mozilla.org. The double good news is the bug is fixed and we eventually delivered all messages to Pulse, so no data was lost. (This is why pulsebot spewed a bunch of delayed messages in #developers a few minutes ago.) This is more or less how queue systems are supposed to work (modulo the delivery delay of course). Yay for no data loss.

Ed Morley [:emorley]

Updated

•

8 years ago

Blocks: 1049530

MozReview Request: testing: start Pulse container as part of hg cluster (bug 1196915); r?smacleod 9 years ago Gregory Szorc [:gps] 40 bytes, text/x-review-board-request		Details
MozReview Request: hgserver: publish pushes to Pulse (bug 1196915) 9 years ago Gregory Szorc [:gps] 40 bytes, text/x-review-board-request		Details
fix the link in the docs 8 years ago Axel Hecht [:Pike] 842 bytes, patch	gps : review+	Details \| Diff \| Splinter Review