pulse and sns notifier lag due to running `hg log`

RESOLVED FIXED

Status

Developer Services
Mercurial: hg.mozilla.org
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: gps, Assigned: gps)

Tracking

Details

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

(Assignee)

Description

a year ago
The pulse and sns notifier lag checks just fired. They were spending a lot of time processing an obsolescence replication message for users/gszorc_mozilla.com/firefox.

I've pushed to this repo several times before without issue. And this push was far from the largest push I've made. The push did have 384 obsolescence markers. But again, far from the most I've pushed at one time.

The pulse and sns consumer daemons were spending dozens of seconds processing ~11 replication messages containing obsolescence data. There were ~38 obsolescence markers in each message.

They appeared to be getting stuck in https://hg.mozilla.org/hgcustom/version-control-tools/file/49e4453aabdf/pylib/vcsreplicator/vcsreplicator/pushnotifications.py#l179.

I dumped one of the messages and basically issued the commands from that function manually. It quickly became apparent that the command taking a long time was `hg --hidden log -r <rev>`. Adding --profile to the command revealed it was spending 98% of time computing tags data.

Tags data has historically plagued Mozilla. However, for the past few years, Mercurial has done a pretty decent job of caching this data on first resolution and we haven't had any performance problems with tags resolution on the servers.

I looked at the .hg/blackblox.log for this repo and noticed that the tags cache hit rate wasn't high. (It should be 100% on subsequent reads unless new changesets were added.)

Anyway, the underlying problem appears to be file permissions. The pulse and sns notifier daemons don't have write permissions to the repo. So they can't write the tags cache. And since nothing else likely resolves tags, the tags cache isn't getting populated by anything and is drifting out of date. Performance slowly degrades.

Unless we want to give the notification daemons the ability to write to the .hg/cache directory of repos (which I'd highly advise against because I like those processes not having write privileges), potential solutions include:

1) A periodic job that crawls the repos and bulk updates tag caches
2) A global hook that runs after repo pushes and triggers tags resolution. This will ensure the tags cache is populated.
3) Trigger tags resolution off the replication system (using another consumer daemon like we have for pulse/sns)
4) Systemd timer/unit that is activated by a repo push and triggers tags cache population asynchronously from the push

In all cases except #2, we have to tackle permissions, since in all cases except #2 we may not be running as a user that has write access to all repos. Since systemd units can run as root, if we go that route we could have the invoked process look at the repo user/group owner and setuid/setgid accordingly before invoking an `hg` command to populate the tags cache.
Comment hidden (mozreview-request)
Assignee: nobody → gps

Comment 2

a year ago
mozreview-review
Comment on attachment 8876332 [details]
hghooks: hook to trigger cache population (bug 1358239);

https://reviewboard.mozilla.org/r/147736/#review154284

nice
Attachment #8876332 - Flags: review?(glob) → review+

Comment 3

a year ago
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/4822814f9f1f
hghooks: hook to trigger cache population ; r=glob
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
(Assignee)

Comment 4

a year ago
This is deployed. I basically ran `sudo -u hg hg tags` on all non-user repos to seed the tags cache. Some repos definitely didn't have a tags cache because the command took >60s to run on some repos.
You need to log in before you can comment on or make changes to this bug.