Closed Bug 1358239 Opened 7 years ago Closed 7 years ago

pulse and sns notifier lag due to running `hg log`

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

Details

Attachments

(1 file)

hghooks: hook to trigger cache population (bug 1358239); 7 years ago Gregory Szorc [:gps] 59 bytes, text/x-review-board-request	glob : review+	Details

Gregory Szorc [:gps]

Assignee

Description

•

7 years ago

The pulse and sns notifier lag checks just fired. They were spending a lot of time processing an obsolescence replication message for users/gszorc_mozilla.com/firefox.

I've pushed to this repo several times before without issue. And this push was far from the largest push I've made. The push did have 384 obsolescence markers. But again, far from the most I've pushed at one time.

The pulse and sns consumer daemons were spending dozens of seconds processing ~11 replication messages containing obsolescence data. There were ~38 obsolescence markers in each message.

They appeared to be getting stuck in https://hg.mozilla.org/hgcustom/version-control-tools/file/49e4453aabdf/pylib/vcsreplicator/vcsreplicator/pushnotifications.py#l179.

I dumped one of the messages and basically issued the commands from that function manually. It quickly became apparent that the command taking a long time was `hg --hidden log -r <rev>`. Adding --profile to the command revealed it was spending 98% of time computing tags data.

Tags data has historically plagued Mozilla. However, for the past few years, Mercurial has done a pretty decent job of caching this data on first resolution and we haven't had any performance problems with tags resolution on the servers.

I looked at the .hg/blackblox.log for this repo and noticed that the tags cache hit rate wasn't high. (It should be 100% on subsequent reads unless new changesets were added.)

Anyway, the underlying problem appears to be file permissions. The pulse and sns notifier daemons don't have write permissions to the repo. So they can't write the tags cache. And since nothing else likely resolves tags, the tags cache isn't getting populated by anything and is drifting out of date. Performance slowly degrades.

Unless we want to give the notification daemons the ability to write to the .hg/cache directory of repos (which I'd highly advise against because I like those processes not having write privileges), potential solutions include:

1) A periodic job that crawls the repos and bulk updates tag caches
2) A global hook that runs after repo pushes and triggers tags resolution. This will ensure the tags cache is populated.
3) Trigger tags resolution off the replication system (using another consumer daemon like we have for pulse/sns)
4) Systemd timer/unit that is activated by a repo push and triggers tags cache population asynchronously from the push

In all cases except #2, we have to tackle permissions, since in all cases except #2 we may not be running as a user that has write access to all repos. Since systemd units can run as root, if we go that route we could have the invoked process look at the repo user/group owner and setuid/setgid accordingly before invoking an `hg` command to populate the tags cache.

Comment hidden (mozreview-request)

Mozilla has historically had problems with the tags cache causing
performance problems. Most of these problems were resolved a few years
ago by a rewritten tags cache implementation in Mercurial. However,
for that cache to work it needs to be populated. And, the cache isn't
populated until something attempts to resolve a symbol that could be
a tag.

This is normally not a problem. On hgweb, the tags cache will be
updated on most page views. However, it can be problematic on the
master server.

Nothing in the regular push flow appears to access tags data. This
means that the tags cache may not be written on the master server.
It so happens that we have processes running on the master server
that don't have write privileges to repos. So while they may trigger
tags cache population, the cache I/O fails and the cache is never
written. Assuming the cache is never written, over time these processes
have to compute more and more tags data and over time the amount of CPU
required balloons. Enough time passes and alerts start to fire because
processes that should be quick are spending dozens of seconds computing
tags data.

This commit solves that problem by implementing a pre transaction close
hook that accesses tags data, thus ensuring the tags cache is up to
date. Other processes should never have to compute tags data again.
Furthermore, `hg pull` will transfer tags cache data. So populating
the cache before replication ensures that hgweb machines also don't
have to populate the cache. Finally, the bundles generated for
clone bundles should have current tags caches, ensuring that people
who clone from them don't need to regenerate the data. So ensuring
the tags cache is populated on the master server is full of wins.

The new hook is globally installed on the master server so all repos
benefit from tags cache generation.

Unless we pre-populate the tags cache on all repos, the first push to
a repo after this is enabled may be slow.

All subsequent pushes may slow down because of this hook. The tags
cache population time is proportional to the number of files in a
repo and the number of heads being pushed. However, the common case
is 1 head per push and Mercurial should cache the manifest for that
head, thus assuring rapid tags cache generation. The tags cache
generation time is recorded in blackbox logs. So if the logs show
this hook causes too much of a perf drain, we can look into alternate
mechanisms for populating the tags cache.

Review commit: https://reviewboard.mozilla.org/r/147736/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/147736/

:glob ✱

Updated

•

7 years ago

Assignee: nobody → gps

:glob ✱

Comment 2

•

7 years ago

mozreview-review

Comment on attachment 8876332 [details]
hghooks: hook to trigger cache population (bug 1358239);

https://reviewboard.mozilla.org/r/147736/#review154284

nice

Attachment #8876332 - Flags: review?(glob) → review+

Pulsebot

Comment 3

•

7 years ago

Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/4822814f9f1f
hghooks: hook to trigger cache population ; r=glob

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Gregory Szorc [:gps]

Assignee

Comment 4

•

7 years ago

This is deployed. I basically ran `sudo -u hg hg tags` on all non-user repos to seed the tags cache. Some repos definitely didn't have a tags cache because the command took >60s to run on some repos.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

pulse and sns notifier lag due to running `hg log`

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

Tracking

(Not tracked)

People

(Reporter: gps, Assigned: gps)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Attachment

General

Description

File Name

Content Type