Closed Bug 1250169 Opened 8 years ago Closed 7 years ago

RSS collection in tp5 should collect data from all processes, especially content process in e10s mode

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(e10s+)

RESOLVED WONTFIX
Tracking Status
e10s + ---

People

(Reporter: jmaher, Unassigned)

References

(Blocks 1 open bug)

Details

Currently we measure the main process RSS, but in e10s land we have main/content process.

we need to sort out the other counters as well.
Is tp5rss still relevant with AWSY and friends now reporting to perfherder? Just asking, not implying.
this is a good time to evaluate what is different.  I would be happy to remove memory counters that are duplicated in AWSY as long as we have e10s/non-e10s data on all platforms.
:erahm, do you have thoughts here?
Flags: needinfo?(erahm)
(In reply to Joel Maher (:jmaher) from comment #3)
> :erahm, do you have thoughts here?

It depends on what tp5rss does. If we're talking about the tp5 test described on the wiki [1], it sounds like it's measuring something different than what AWSY does. The description for RSS and Working Set indicates they're sampled every 20s, so that could be useful for detecting transient memory spikes (well maybe better then AWSY at least).

AWSY opens 100 tp5 pages in 30 tabs waiting 10 seconds per tab, then closes the tabs. It does this 5 times and makes several final memory measurements afterwards (immediately, after 30 seconds, and after forcing garbage collection), we then close the tabs and remeasure. We also make an effort to simulate user action to trigger other heuristic based garbage collection (such as compacting GC).

It should also be noted we're not sending e10s data to treeherder yet. We do have support, but I'm not sure how we want to report that data.

Do we want to separate the data? This gets tricky with multiple content processes (their PID isn't deterministic), and do we want the RSS or USS of the content processes?

Should it be one combined metric as I've been doing for e10s memory analysis? For that case I do |total_memory = RSS(parent) + SUM(USS(children))|.

[1] https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5
Flags: needinfo?(erahm)
Oh right, also AWSY is only on Linux currently. It works on other platforms but we only have one test machine.
for Talos we only collect counters for tp5 runs, so we are effectively measuring the same thing.  We do e10s, non e10s, opt, pgo, linux|win|osx.  Maybe we should compare data for linux where we think there is duplication and either terminate collecting memory for linux or accept duplication for the time being.

Another thing is that we could modify talos to collect similar numbers to AWSY until we get more os/config coverage on AWSY.

I think we need to answer:
* what test makes sense to record memory (tp5 is different that AWSY, but similar)
* what specifically do we want to collect (RSS from the process or system polling)

answering that should help us determine which system and what to record.

Regarding the mention of collecting data on a timer from the OS vs collecting RSS from the process after each pageload, we only report the average data to perfherder and don't store the entire collection.  We had looked at this about 6 months ago and determined that there was not a simple way to determine what was useful other than the average value.
Blocks: e10s-tests
tracking-e10s: --- → +
after further examination, it looks as if we taken child processes into account for linux as well as 'plugin-container' for windows.  Possibly we can close this as worksforme?
Interesting behaviour: Private Bytes is being reported twice as often for tp5o-e10s than normal tp5. A modified cmanager_linux.py printing out the pidlist shows that the single, low result at the beginning of the test is from before the second process starts, but all the others come from managers managing two processes: http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/656163f081fd2bc74cfd3d8ad31ef75fee529bd3f9503ae3661dd93032860dccbadb84aa78fad5f5aefe62aa64789eb3801d659b75b193b4eb866976ceda988a
I cannot decipher the log file/comment- is this what we expected via irc?
Sorry about that.

The log is showing that, as expected, the first sample is from before the second process spins up. 

The manager was modified to output the sample and the pidlist at the same time. Unfortunately, from just that I can only confirm what we already suspected about the one single-process sample. I still don't know the reason for t extra samples.
Can I get some clarification on what "private bytes" means in this context? Or can you just point me to the code doing the measurements? Are we summing the USS of each process or is it RSS or is it a combo?
Flags: needinfo?(chutten)
private bytes is outlined here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1253984#c2

For windows we add a counter for each process and then sum those counters:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/cmanager_win32.py#224

I am not sure if those pdh counters relate to USS or RSS.
odd though, we started this bug talking about RSS, then turned it into private bytes.

right now RSS is collected from inside the browser:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/pageloader/chrome/memory.js#66
I blame the :jmaher of 8 days ago for directing me to post here the partial results of my Private Bytes investigation :)

Maybe the Private Bytes counter needs its own bug?
Flags: needinfo?(chutten)
(In reply to Joel Maher (:jmaher) from comment #13)
> odd though, we started this bug talking about RSS, then turned it into
> private bytes.
> 
> right now RSS is collected from inside the browser:
> https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/
> pageloader/chrome/memory.js#66

Okay, so memory.js is broken. I'm having a hard time understanding the intent of it, but no matter what the intent is it's not accomplishing it's goals. It kind of stumbles into doing the right thing for the single process case (if the right thing is to print the RSS of main process).

Lets take a step back and discuss this measurement. Even if it did the "right thing" and measured the RSS of every process, I'm not sure that's a terribly useful measurement. Summing the RSS of the parent and the USS of the children is probably more useful.

I believe we do that in another measurement (someone with more Talos knowledge can confirm that), if that's the case maybe we should just drop this measurement or keep it intentionally main process only.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.