Closed Bug 1424461 Opened 6 years ago Closed 5 years ago

Resident Memory linux64 increased significantly after TaskCluster worker upgrade

Categories

(Testing :: AWSY, defect)

Version 3
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gps, Unassigned)

References

Details

Attachments

(1 file)

In bug 1291940 I made some changes to the TaskCluster worker environment to replace Docker's use of AUFS with overlayfs. Along the way, the Linux kernel was upgraded from some version of 3.13 to some version of 4.4.

When we rolled out the new machine images (AMIs) to TaskCluster, the "Resident Memory" metric for AWSY increased substantially. We can see this at:

https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1549348,1,4&zoom=1511935636955.0562,1512780612000,425000000,571910112.3595506

There are two big bumps in the past week. One for the initial deploy of the new AMIs. Another (now) for the subsequent deploy a few weeks later. (The old AMIs were restored after issues with the original deployment were spotted.) The 2 AMIs exhibiting increased memory are almost identical in terms of composition.

When we rolled out new AMIs, the "Resident Memory" metric was bi-modal with groups determined by the AMI the task ran on. Old AMIs with 3.13+AUFS had lower memory use. New AMIs with 4.4+overlayfs had higher memory use.

We know this is a side-effect of the AMIs on the AWSY test workers - not the build workers or a way Firefox is built - because:

a) We rolled out updated AMIs to the build workers ~26 hours ago but we didn't see an AWSY regression until ~6 hours ago, when the updated AMIs were also rolled out to test workers.
b) Retriggering AWSY on an old build/task that was in the middle of a block of low memory results now yields bi-modal memory usage

So, something about the TC worker change caused Firefox's about:memory to report a higher RSS for this build configuration. I'm not sure if Firefox is actually using more memory now, whether Firefox/Linux is reporting memory usage differently, or what.

Supporting task retriggers:

* https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=41529b9b705f575fcffc0ed33213250632889c7d&group_state=expanded&filter-searchStr=sy
* https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=33570a2d4633a6fad41cbd706dd5bf2eb266720d&group_state=expanded&filter-searchStr=sy
* https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=6e602239c62f1c75d8f10b2a334298b6b5c765e2&group_state=expanded&filter-searchStr=sy
Looks like this happens on all build configurations, not just no-stylo.

https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1549348,1,4&series=autoland,1534508,1,4&series=autoland,1480715,1,4&zoom=1511601146224.719,1512780612000,420599250.9363296,592883895.1310861
Summary: Resident Memory opt stylo-disabled linux64 increased significantly after TaskCluster worker upgrade → Resident Memory linux64 increased significantly after TaskCluster worker upgrade
glandium had a theory for this during the work week. Something to do with changes to memory allocation in Linux or something. Needinfo him to get that theory written down.
Flags: needinfo?(mh+mozilla)
I was thinking this could be transparent huge pages. I'll try to confirm this theory.
Flags: needinfo?(mh+mozilla)
A basic test disabling transparent huge pages suggests that's not the problem.
There is not enough information in the about:memory reports that awsy records. We'd need the data from e.g. /proc/pid/smaps. Greg, how can one force to use the old kernel now to gather that data?
Flags: needinfo?(gps)
dustin hooked me up with access to generate my own AMIs. I have yet to learn how all that works. I'll try to do that this week. I should be able to provide you with access to an EC2 instance that you can SSH into. Keeping needinfo on me until I do that...
I got access to the necessary AWS account and hooked glandium up with a one-off instance running the AMI before we mucked with AUFS and the kernel version in November.
Flags: needinfo?(gps)
(In reply to Mike Hommey [:glandium] from comment #8)
> Created attachment 8941342 [details]
> Bug 1424461 - Revival of /proc/self/smaps reporting
> 
> This code was removed in bug 912165. This forward-port doesn't take care
> of making it nicer in about:memory.

Only attached this because the forward-port was not straightforward, and this might be useful in the future.

At the moment, I'm getting a crash early when running AWSY on the one-off instance that gps setup, but it looks like there's already enough difference in memory use at StartSettled time that there might be something to find there.
Here's the difference in RSS I see at StartSettled:

With the old kernel:
203.66 MB (100.0%) -- rss
├──125.63 MB (61.69%) -- anonymous/anonymous, outside brk()
│  ├──125.04 MB (61.40%) ── [rw-p] [100]
│  └────0.59 MB (00.29%) ── [r-xp] [11]
├───54.09 MB (26.56%) -- shared-libraries
│   ├──37.77 MB (18.54%) -- shared-libraries-mozilla
│   │  ├──35.34 MB (17.35%) -- libxul.so
│   │  │  ├──31.05 MB (15.25%) ── [r-xp] [2]
│   │  │  ├───4.12 MB (02.02%) ── [r--p]
│   │  │  └───0.17 MB (00.08%) ── [rw-p]
│   │  └───2.43 MB (01.19%) ++ (16 tiny)
│   └──16.32 MB (08.01%) -- shared-libraries-other
│      ├──13.56 MB (06.66%) ++ (128 tiny)
│      └───2.76 MB (01.36%) -- libgtk-3.so.0.1800.9
│          ├──2.70 MB (01.33%) ── [r-xp]
│          └──0.06 MB (00.03%) ++ (2 tiny)
├───23.84 MB (11.71%) -- other-files
│   ├──11.11 MB (05.46%) ── omni.ja/[r--p] [2]
│   ├──10.16 MB (04.99%) ── SYSV00000000/[rw-s] [2]
│   └───2.57 MB (01.26%) ++ (19 tiny)
└────0.10 MB (00.05%) ++ (2 tiny)

With the new one:
262.48 MB (100.0%) -- rss
├──139.32 MB (53.08%) -- anonymous/anonymous, outside brk()
│  ├──138.74 MB (52.86%) ── [rw-p] [97]
│  └────0.58 MB (00.22%) ── [r-xp] [13]
├───90.35 MB (34.42%) -- shared-libraries
│   ├──61.56 MB (23.45%) -- shared-libraries-mozilla
│   │  ├──57.71 MB (21.99%) -- libxul.so
│   │  │  ├──53.43 MB (20.35%) ── [r-xp] [2]
│   │  │  ├───4.12 MB (01.57%) ── [r--p]
│   │  │  └───0.17 MB (00.07%) ── [rw-p]
│   │  └───3.84 MB (01.46%) ++ (16 tiny)
│   └──28.79 MB (10.97%) -- shared-libraries-other
│      ├──23.40 MB (08.91%) ++ (128 tiny)
│      └───5.39 MB (02.06%) -- libgtk-3.so.0.1800.9
│          ├──5.33 MB (02.03%) ── [r-xp]
│          └──0.06 MB (00.02%) ++ (2 tiny)
├───32.70 MB (12.46%) -- other-files
│   ├──17.33 MB (06.60%) ── omni.ja/[r--p] [2]
│   ├──10.16 MB (03.87%) ── SYSV00000000/[rw-s] [2]
│   └───5.21 MB (01.99%) ++ (19 tiny)
└────0.10 MB (00.04%) ++ (2 tiny)

(not putting the diff about:memory gives, because somehow, it's wrong)

So it seems the newer kernel is more aggressive about keeping files data in memory. In fact, it seems the older kernel was suspiciously lenient, especially considering we're hinting the kernel to read the libraries ahead. We map ~80MB of libxul.so, and with the old kernel we only ended up with 35.3MB of it in memory, vs. 57.7MB with the new one.

It seems it might be worthwhile tracking the resident size in anonymous pages. Eric, what do you think?

I'd like to be able to do a full awsy run, though, to confirm this is the only cause. At the moment, it crashes (SIGBUS) during the first loop, so before storing any report after the StartSettled one. I wish we just had a workertype available with those old AMIs, so that they could be easily triggered, in the exact way they're run by taskcluster, because at the moment, I don't know what's different between my manual setup and tc's. Greg, do you think you could set that up? (or get someone from the TC team to set that up)
Flags: needinfo?(gps)
Flags: needinfo?(erahm)
I'm actively learning how docker-worker works so I can hack on it. I'll hopefully be in a good position by end of the day to get an old AMI up and running in a way that you can send tasks to it. Will leave needinfo set to track me doing this.
(In reply to Mike Hommey [:glandium] from comment #10)

> It seems it might be worthwhile tracking the resident size in anonymous
> pages. Eric, what do you think?

Huh, this code has been added and removed a few times (see bug 1412949). I'm okay with bringing it back, but we should:
  a) add a note that says "you may think we don't need this but we do"
  b) enabled it by default on Linux so that it has code coverage
  c) add some tests to make sure it keeps working
  d) add some documentation [1] on how to interpret the numbers and why they're useful

Also I wonder if this will run afoul of sandboxing?

[1] https://developer.mozilla.org/en-US/docs/Mozilla/Performance/about:memory
Flags: needinfo?(erahm)
I'm not talking about the full blown stuff, just one entry for anynomous resident only, and graphing it. That only requires to read /proc/self/smaps, which we already read for resident-unique.
(In reply to Mike Hommey [:glandium] from comment #13)
> I'm not talking about the full blown stuff, just one entry for anynomous
> resident only, and graphing it. That only requires to read /proc/self/smaps,
> which we already read for resident-unique.

Ah okay so just rss-anonymous and then add tracking in awsy. That's fine, but why just that number and not shared libs and other files?
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #14)
> (In reply to Mike Hommey [:glandium] from comment #13)
> > I'm not talking about the full blown stuff, just one entry for anynomous
> > resident only, and graphing it. That only requires to read /proc/self/smaps,
> > which we already read for resident-unique.
> 
> Ah okay so just rss-anonymous and then add tracking in awsy. That's fine,
> but why just that number and not shared libs and other files?

We have resident, and we'd have resident-anonymous. The difference between both is obviously file mappings. While it might be interesting to know which specific files are contributing to that difference, I'm not sure we need that enough to have it there always.
Try running tasks on the gecko-1-b-linux-gps worker type. This is the same configuration as gecko-t-linux-xlarge, just running an older AMI with the older kernel and AUFS instead of overlayfs.

No promises it works. If it doesn't ping on IRC in #taskcluster and/or needinfo me.

When it comes to getting docker workers to run, pinging dustin or jonasfj may yield better results.
Flags: needinfo?(gps)
Flags: needinfo?(mh+mozilla)
Here's a try push with two awsy jobs:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7d450764cc66a8ad3c0e37f165821d1b7ebcca42

One of them is with a new worker, and the other is with the old worker.
The try push includes the patch that adds per-file rss info.

Comparing memory-report-TabsOpen-2.json.gz from both awsy runs seems to confirm comment 10: the new kernel keeps more of the files in memory, and that seems to be the only difference. I'd appreciate another set of eyes looking at those results, though. Eric, could you take a look?
Flags: needinfo?(mh+mozilla) → needinfo?(erahm)

(In reply to Mike Hommey [:glandium] from comment #17)

Here's a try push with two awsy jobs:
https://treeherder.mozilla.org/#/
jobs?repo=try&revision=7d450764cc66a8ad3c0e37f165821d1b7ebcca42

One of them is with a new worker, and the other is with the old worker.
The try push includes the patch that adds per-file rss info.

Comparing memory-report-TabsOpen-2.json.gz from both awsy runs seems to
confirm comment 10: the new kernel keeps more of the files in memory, and
that seems to be the only difference. I'd appreciate another set of eyes
looking at those results, though. Eric, could you take a look?

I'm reasonably sure I looked at this at some point, but don't recall the conclusion. Unfortunately the treeherder data has expired.

Flags: needinfo?(erahm)

Let's assume I was right, then, which means there's not much we can do about it. We may still want to act on comment 13, though. WDYT?

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(erahm)
Resolution: --- → WONTFIX

(In reply to Mike Hommey [:glandium] from comment #19)

Let's assume I was right, then, which means there's not much we can do about it. We may still want to act on comment 13, though. WDYT?

Yeah, I think it makes sense to add the measurement and track it.

Flags: needinfo?(erahm)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: