Closed Bug 1436914 Opened 6 years ago Closed 6 years ago

0 vsize values on Windows after CFG Patch

Categories

(Firefox Build System :: General, defect, P3)

defect

Tracking

(firefox60 affected)

RESOLVED FIXED
Tracking Status
firefox60 --- affected

People

(Reporter: tjr, Unassigned)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(1 file)

From https://bugzilla.mozilla.org/show_bug.cgi?id=1235982#c115 :


This bug may have contributed to a sudden change in the Telemetry probe MEMORY_VSIZE[1] which seems to have occurred in Nightly 20180201[2][3].

There was a sudden appearance of dominating samples with the value 0, only for Windows[4]. I really have no idea how Windows would report a virtual memory size of 0 all of a sudden, but this seems like a Windows-only memory-adjacent change..

Is this intentional? Is this expected?

[1]: http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/824/alerts/?from=2018-02-01&to=2018-02-01
[2]: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3804441e575c9f46fcb03894de3c780eeae7197f&tochange=ee7f64d0764961e9663c681efc8d2b3759af79d6
[3]: https://mzl.la/2BLgEyE
[4]: https://mzl.la/2BNv29M
(In reply to Chris H-C :chutten from comment #115)
> This bug may have contributed to a sudden change in the Telemetry probe
> MEMORY_VSIZE[1] which seems to have occurred in Nightly 20180201[2][3].
> 
> There was a sudden appearance of dominating samples with the value 0, only
> for Windows[4]. I really have no idea how Windows would report a virtual
> memory size of 0 all of a sudden, but this seems like a Windows-only
> memory-adjacent change..
> 
> Is this intentional? Is this expected?
> 
> [1]:
> http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/824/
> alerts/?from=2018-02-01&to=2018-02-01
> [2]:
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=3804441e575c9f46fcb03894de3c780eeae7197f&tochange=ee7f
> 64d0764961e9663c681efc8d2b3759af79d6
> [3]: https://mzl.la/2BLgEyE
> [4]: https://mzl.la/2BNv29M

I agree, this is suspicious, and CFG seems the most likely culprit.

As far as I can tell, this is how we get that data on Windows: https://searchfox.org/mozilla-central/rev/a539f8f7fe288d0b4d8d6a32510be20f52bb7a0f/xpcom/base/nsMemoryReporterManager.cpp#578

I copied that code into a new project and ran it with and without CFG; and it gave me data.

Then I used a local build with CFG; used the browser toolbox to do Cc["@mozilla.org/memory-reporter-manager;1"].getService(Ci.nsIMemoryReporterManager)["vsize"] and got a real value.  I did it on Nightly as well. (If anyone else has Windows your attempt would be appreciated too.)

I don't know why vsize would be zero. While we did modify some stuff in that file, we didn't modify vsize. It's entirely possible CFG is somehow fiddling with it, but I think we need to find some more correlations to nail it down.

It appears to only be happening on Windows 8, 8.1, and 10; AFAICT. And it's not _only_ giving zero values, there are legitimate values being sent, it's just some of the values are zeros.

Is there any way to investigate the zero values to try and find another correlation?
Flags: needinfo?(chutten)
My full-intuition (ie, with zero proof) guess is that the 0s are being recorded early in startup or late in shutdown.

We poll the vsize amount irregularly via TelemetrySession's gatherMemory. Could you instrument the CFG build's gatherMemory to yell to the console (set toolkit.telemetry.log.level to 0 and use this._log) if it ever sees a zero? Maybe that'll nail it down if it's a "happens occasionally on some/most sessions" kind of thing.
Flags: needinfo?(chutten)
As for investigation... well, I can start up a cluster and poke around pings that show 0 values to see if they share traits. Maybe it's just one client bombarding us with millions of 0s. Maybe it's coming from every ping on affected systems. 

It'll take a while to start and for me to run the analysis. *time passes*

Okay.

From a sample of main pings sent from nightly builds with build ids > 20180202 this month, we have nearly two-thirds with 0 values in vsize. They are _not_ dominated by one or two specific clients, but appear to be spread amongst many clients in the usual distribution.

Very few clients that send histograms with zeros also send other pings that have no zeros in them.

No pings with 0 samples have non-0 samples in the same histogram. 

-----

In combination this seems to suggest that, if you're susceptible to recording 0s, you're definitely recording only 0s throughout the subsession, and are likely to record only 0s from now on.

-----

Looking at the Environment, we already knew that those affected were Windows-only. So I looked at os version, architecture, and sandbox level. I saw pings from every version of Windows 6.1+; sandboxing levels 0, 1, 2, 5, and 6; ...and...

Only from 64-bit users.

I thought this might be due to a too-small sample size, so I looked at _all_ Nightly main pings coming from builds Feb2 and onwards instead of just a sample.

Only 64-bit users send us zeros.

Maybe this is a clue.

I have posted my scratch analysis as a gist over here if you want to see what I did (and maybe check my logic): https://gist.github.com/chutten/5337c45ce106469794be3d62f7ae752b
I just loaded about:memory on today's 64-bit nightly and at the bottom I got:

  2,098,682.57 MB ── vsize

In words that's... over two terabytes? 

On a somewhat old nightly sitting around on my disk, I got something more reasonable:

      1,522.81 MB ── vsize

Perhaps CFG causes a bunch of stuff to happen in the address space, and this 2TB value overflows whatever width of integer telemetry is using to store it?
Histograms.json says:

  "MEMORY_VSIZE": {
...
    "description": "Virtual memory size (KB)"

We do have over INT_MAX KB's.
vsize is reported[1] as an int64_t which Telemetry coerces to uint32[2]. It would have to be 0 to be recorded as 0. The semantics[3] of of the uint32_t coercion suggests that we'd get the overflown value, not a large negative one (ie, for a number value of uint32_t::max + 42, we'd be returned 42).

[1]: https://searchfox.org/mozilla-central/rev/b9f1a4ecba48b2d8c686669e32d109c40e927b48/xpcom/base/nsIMemoryReporter.idl#383
[2]: https://searchfox.org/mozilla-central/rev/b9f1a4ecba48b2d8c686669e32d109c40e927b48/toolkit/components/telemetry/TelemetryHistogram.cpp#1183
[3]: https://searchfox.org/mozilla-central/rev/b9f1a4ecba48b2d8c686669e32d109c40e927b48/js/public/Conversions.h#307
Sounds reasonable... but my money's still on overflow. :)
It is the safer bet :)

:tjr, do :dmajor's observations align with what you see locally?
Flags: needinfo?(tom)
Looking at about:telemetry on my local Win64 Nightly (specifically about:telemetry#histograms-tab_search=MEMORY ), I see that I have all-zero samples for MEMORY_VSIZE:

MEMORY_VSIZE
87 samples, average = 0, sum = 0

    0 |#########################  87  100%
32768 |  0  0%


Happy to help with any info I can provide.
What does your about:memory have reporting for vsize? Are you above uint32 max?
Flags: needinfo?(ted)
2,099,209.98 MB ── vsize

Which seems to match the value I got from calling nsIMemoryReporterManager.vsize in the browser console:
Cc["@mozilla.org/memory-reporter-manager;1"].getService(Ci.nsIMemoryReporterManager)["vsize"] / (1024  * 1024)
2099213.4609375
Flags: needinfo?(ted)
And yeah, the raw value in bytes is above 2^32:
2201166503936
I broke in the VsizeDistinguishedAmount to sanity check, and the values reported there match what the JS API is returning. I googled around to figure out how to measure this externally, and per this stackoverflow answer: https://stackoverflow.com/a/34771743/69326 I got:

C:\Users\ted>wmic process where processid=6236 get VirtualSize
VirtualSize
2200894529536

This seems to be an expected result of enabling CFG:
http://www.alex-ionescu.com/?p=246

So I think there are two things of note here:
1) MEMORY_VSIZE is no longer a useful metric on Win64 with CFG
2) Something in the telemetry code that coerces this value to a 32-bit value clearly breaks in the face of values that overflow. It might be good to sort that out, but even if we do, reporting the actual value here isn't going to give us any useful information.
comment #6 has the details of how it is coerced to a uint32_t. (It's through JS::ToUint32)

So if MEMORY_VSIZE is not a useful metric on win64 with cfg... should we remove the metric? Should we replace it with a MB metric to try to buy us a few months?

Also, should we fix ToUint32? According to the comment, it's supposed to report the overflown value, not 0.
(In reply to Chris H-C :chutten from comment #14)
> comment #6 has the details of how it is coerced to a uint32_t. (It's through
> JS::ToUint32)
> 
> So if MEMORY_VSIZE is not a useful metric on win64 with cfg... should we
> remove the metric? Should we replace it with a MB metric to try to buy us a
> few months?

I don't think any permutation of the existing value is going to be usable. It's intended to be a measurement of how much of the address space we've used, but with CFG we can apparently no longer get a useful number for that in this way. It's possible we could write some more complicated code to get back to a more useful number here, I'm not sure. In general, address space for 64-bit builds is not a problem. 32-bit Windows binaries can access at most 4GB of address space (when running on a 64-bit version of Windows). 64-bit Windows binaries can access 8TB of address space, so it's likely that things would fail long before we exhausted that.

> Also, should we fix ToUint32? According to the comment, it's supposed to
> report the overflown value, not 0.

This sounds reasonable. If the function isn't doing what it says it does, then we should either fix it or fix its documentation!
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15)
> It's possible we could write some more complicated code to get
> back to a more useful number here, I'm not sure.

Here's the culprit block on a newborn process:

RegionSize:        000001ffd9002000
State:             00002000  MEM_RESERVE
Type:              00040000  MEM_MAPPED

0:000> ? 000001ffd9002000
Evaluate expression: 2198368952320 = 000001ff`d9002000

Nothing really stands out in its flags to make it unique. I guess we could go by size: look for a crazy-huge reserved block and filter it out of the vsize metrics... but it's probably not worth the trouble given that address space isn't much of a concern on 64-bit builds.
See Also: → 1438335
I'll take the "reporting values between 2^31 and 2^32 go to the 0 bucket for some reason" bug over to Toolkit::Telemetry for investigation.

I'll leave dealing with the vsize distinguished amouunt, MEMORY_VSIZE(_MAX_CONTIGUOUS), and the rest to this bug.
(In reply to Chris H-C :chutten from comment #17)
> I'll take the "reporting values between 2^31 and 2^32 go to the 0 bucket for
> some reason" bug over to Toolkit::Telemetry for investigation.

Excellent!

> I'll leave dealing with the vsize distinguished amouunt,
> MEMORY_VSIZE(_MAX_CONTIGUOUS), and the rest to this bug.

Do we know if there is anything else to do? I.e. can we currently tell if any of these 0 values are actually above 2^32?
(In reply to Nicholas Nethercote [:njn] from comment #18)
> Do we know if there is anything else to do? I.e. can we currently tell if
> any of these 0 values are actually above 2^32?

From my investigation, it sounds like *by design*, CFG-enabled binaries will reserve >2TB of address space on startup. I don't think there's any meaningful data to be had from that measurement on Win64 builds regardless of whether we fix the overflow issue. Per dmajor's comment 16, we could probably calculate a different metric for Win64 by trying to filter out the one enormous region. Even then, I think my point in comment 15 stands--measuring virtual memory usage just isn't that useful on 64-bit builds, because it's not a limiting factor. (This is one of the primary reasons we wanted to move Win32 users to Win64, right?)
(In reply to Nicholas Nethercote [:njn] from comment #18)
> Do we know if there is anything else to do? I.e. can we currently tell if
> any of these 0 values are actually above 2^32?

To answer the second part of the question:

From my investigation in bug 1438335, the 0 values are any values that are actually 0 or, when taken mod 2^32, are between 2^31 and 2^32. This is something I plan on fixing (so that large values are always accumulated to the "high" bucket) and then documenting (because right now we say nothing about acceptable sample values beyond providing a JS API that takes Numbers, and a C++ API that takes uint32_t)
(In reply to Chris H-C :chutten from comment #3)
> Only 64-bit users send us zeros.

Out of curiosity, do _all_ 64-bit users send us zeros? If there are any nonzeros, do they correlate with OS? IIUC, only Win10 and patched Win8.1 are "CFG-aware". (And I'm also curious if any of those still don't enable CFG for some reason.)

Perhaps vsize may still be very mildly useful...
Flags: needinfo?(chutten)
(In reply to David Major [:dmajor] from comment #21)
> (In reply to Chris H-C :chutten from comment #3)
> > Only 64-bit users send us zeros.
> 
> Out of curiosity, do _all_ 64-bit users send us zeros? If there are any
> nonzeros, do they correlate with OS? IIUC, only Win10 and patched Win8.1 are
> "CFG-aware". (And I'm also curious if any of those still don't enable CFG
> for some reason.)
> 
> Perhaps vsize may still be very mildly useful...

Looking at the analysis[1], I didn't specifically check if there were 64-bit clients in the sample that didn't send 0s. However, there were no users reporting 0s that reported a Windows version less than 6.3 (well, there were some handfuls claiming 6.0-6.2, but that's probably mostly attributable to noise). We do know that Windows 7 is still a bit of a big deal in the broad Firefox population, x64 or x32, so having essentially none of them reporting 0s suggest it's CFG.

Can't say whether there are x64 8.1+ Windows users reporting non-0s. Would have to rerun the analysis, which I wouldn't be able to get to in the near term. You could do it, if you're interested:

1) Launch a cluster at analysis.telemetry.mozilla.org by following the instructions
2) Load my analysis from the Gist URL
3) create a new environment_counts using not_zeros instead of has_zeros
4) Run the notebook.

Of course, a dedicated scalar that detects CFG might also have value in and of itself, rather than using this to probe the population indirectly.

[1]: https://gist.github.com/chutten/5337c45ce106469794be3d62f7ae752b
Flags: needinfo?(chutten)
It seems as though CFG may have made vsize and vsize_max_contiguous take longer than they used to as well (see profile from bug 1439222).

Maybe we should disable vsize and vsize_max_contiguous reporting on CFG-aware platforms while we consider its broader usefulness?
(In reply to Chris H-C :chutten from comment #23)
> It seems as though CFG may have made vsize and vsize_max_contiguous take
> longer than they used to as well (see profile from bug 1439222).
> 
> Maybe we should disable vsize and vsize_max_contiguous reporting on
> CFG-aware platforms while we consider its broader usefulness?

vsize (GlobalMemoryStatus) ought to be constant-time regardless of address space usage, but vsize_max_contiguous (a VirtualQuery walk) is indeed linear in terms of memory mappings.

OTOH I'm not sure CFG makes it much worse than it was before. The extra mappings should be few. I wouldn't be surprised if vsize_max_contiguous was just slow to begin with.

Either way, I don't think anyone would be opposed to turning it off on 64-bit.
(Clearing ni, since this got figured out before I could investigate.)
Flags: needinfo?(tom)
Blocks: 1439222
How 'bout I just go ahead and remove vsize reporting and get this ball rolling...
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Comment on attachment 8953602 [details]
bug 1436914 - Stop reporting virtual memory size from CFG-enabled processes

https://reviewboard.mozilla.org/r/222804/#review228782

I don't think we should remove this reporting altogether. It's still quite valuable on 32-bit builds where address space exhaustion is a (the?) top source of crashes. Can we disable it for 64-bit only?
Comment on attachment 8953602 [details]
bug 1436914 - Stop reporting virtual memory size from CFG-enabled processes

https://reviewboard.mozilla.org/r/222804/#review228782

Wait, if this is just telemetry code and doesn't affect the display of vsizeMaxContiguous on crash reports, maybe that's a different story. Let's see what njn thinks.
Removing these just from telemetry seems ok.
This is indeed just removing two histograms from Telemetry. Crash reports and pings will contain the annotation, unchanged.

(I couldn't figure out a way in JavaScript to detect 64-bit builds that didn't involve string matching on things like "x86-64" which made me uncomfortable with the accuracy of the patch. So I opted for the rather more controversial approach you see here :)
(In reply to Chris H-C :chutten from comment #31)
> (I couldn't figure out a way in JavaScript to detect 64-bit builds that
> didn't involve string matching on things like "x86-64" which made me
> uncomfortable with the accuracy of the patch. So I opted for the rather more
> controversial approach you see here :)

For 64-bitness, adding something to AppConstants.jsm seems pretty reasonable.
Comment on attachment 8953602 [details]
bug 1436914 - Stop reporting virtual memory size from CFG-enabled processes

https://reviewboard.mozilla.org/r/222804/#review229234

Looks good to me, cheers.

Do we know if this is used in any datasets?
Maybe we can flag mreid to double-check?
Attachment #8953602 - Flags: review?(gfritzsche) → review+
Comment on attachment 8953602 [details]
bug 1436914 - Stop reporting virtual memory size from CFG-enabled processes

https://reviewboard.mozilla.org/r/222804/#review229294

Oh sorry, I had intended to transfer this to njn but left it unpublished. One day I will understand mozreview. Probably the day it gets shut down.
Attachment #8953602 - Flags: review?(dmajor) → review+
VSIZE can in fact be useful; it's helped track down some pretty crazy leaks (shmem, gfx, etc). See bug 1432086 for example.

Is it possible to just subtract the multi-terabyte page on CFG builds?
Whiteboard: [MemShrink]
:mreid, we are considering removing or drastically changing the character of MEMORY_VSIZE and MEMORY_VSIZE_MAX_CONTIGUOUS. Do you know of any datasets or reports that use these measures and would be impacted?
Flags: needinfo?(mreid)
We are just about to add these histograms to the main_summary dataset over in [1]. Other than that, I am not aware of any existing reports or analyses offhand.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1424564#c6
Flags: needinfo?(mreid)
Okay, I see this as if we have three options:

1) Stop reporting it at all. This is what the patch does. (then we close bug 1424564)
2) Stop reporting it on "some" systems. The definition of "some" might be: 64-bit, CFG-aware, or their intersection. (then we have to update the probe descriptions)
3) Fix the measure to discount CFG blocks. According to comment #16 the block isn't satisfyingly obvious to identify. (then we update the probe descriptions)

Option 1 seems to be unpopular (see comment #35 and comment #37)
Option 2 would fail to catch some leaks (see comment #35, bug 1432086)
Option 3 sounds a little loose, but may just be the best option.

So I have a proposal: rather than discounting CFG blocks, let's discount any block over a given size (say, 2^30). We acknowledge that the VSIZE measures are imperfect in their descriptions, add a separate measure to count the sizes of the blocks we had to discount, and move on.

Sound like a plan?
One thing to consider is that finding the CFG block would need a(nother) linear walk of VirtualQuery, effectively doubling the time for a combined vsize+vsize_max_contiguous measurement, unless we somehow combine the two. I'm not sure how awkward that is with our current setup of "distinguished reporters".
As long as we're acknowledging that these measurements are imperfect... how bad would it be if we just subtract 2TB from the number on 64-bit CFG-aware OSes?
How hard is it to detect 64-bit CFG-aware? Maybe we simply report the number mod 2TB?
(In reply to Chris H-C :chutten from comment #41)
> How hard is it to detect 64-bit CFG-aware? Maybe we simply report the number
> mod 2TB?

One can call GetProcessMitigationPolicy https://msdn.microsoft.com/en-us/library/windows/desktop/hh769085(v=vs.85).aspx to detect CFG.

One thing I will mention is that sometime this year we will probably switch to a clang-cl build and CFG will go away. And then some time after that, we will get CFI (hopefully!) and something like this issue may reoccur.

So I guess it'd be good to avoid anything tooo permanent, especially if you like this metric and would like to use it in the future if/when it becomes useful again.
Product: Core → Firefox Build System
Priority: -- → P1
Attachment #8953602 - Flags: review+ → review?(dmajor)
I'm confused. The patch disables vsize reporting on CFG systems and I didn't really see any discussion about that. Is there a reason not to take the "subtract 2TB path"? Also, should we keep reporting enabled on 32-bit CFG systems? (Is there even such a thing?)
(In reply to David Major [:dmajor] from comment #44)
> Also, should we keep reporting enabled on 32-bit CFG
> systems? (Is there even such a thing?)

Yes, CFG is supported on x86 and x64 of Windows 8.1 KB3000850 and above; and Windows 10+
(In reply to David Major [:dmajor] from comment #44)
> I'm confused. The patch disables vsize reporting on CFG systems and I didn't
> really see any discussion about that. Is there a reason not to take the
> "subtract 2TB path"? 

Because the size of the CFG page isn't 2TB, but is <some large figure, liable to change as our binary size changes>. Also, vsize max contiguous is slow (bug 1439222) on at least one CFG-enabled platform so I figured two birds, one stone.

> Also, should we keep reporting enabled on 32-bit CFG
> systems? (Is there even such a thing?)

Does CFG contribute to address exhaustion? Because if it does not, we really shouldn't include the number in our measurements. And if it does, then we should turn it off on 32-bit.
(In reply to Chris H-C :chutten from comment #46)
> Does CFG contribute to address exhaustion? Because if it does not, we really
> shouldn't include the number in our measurements. And if it does, then we
> should turn it off on 32-bit.

I don't know what the address space impact of 32-bit CFG is. They must have some less-memory-intensive (and possibly less-secure and/or slower) variation for 32-bit processes. It would be good to measure.

If we do wind up with CFG on 32-bit builds, I think we should at least report _something_ for vsize, whether it includes CFG areas or not. It would be a pity to turn off vsize on the arch where we need it the most.
Comment on attachment 8953602 [details]
bug 1436914 - Stop reporting virtual memory size from CFG-enabled processes

https://reviewboard.mozilla.org/r/222804/#review232780

I'd like to figure out the plan for win32 before this lands.

::: xpcom/base/nsMemoryReporterManager.cpp:1639
(Diff revision 2)
>  #endif
> +  , mIsCFGEnabled(false)
>  {
> +#ifdef XP_WIN
> +  HMODULE kernel32 = GetModuleHandleW(L"kernel32.dll");
> +  FARPROC ptr = GetProcAddress(kernel32, "GetProcessMitigationPolicy");

A couple nits on the API call.

If your compile is able to see the declaration of GetProcessMitigationPolicy in the headers, then let's use decltype to avoid the duplicated typedef.

Although the return type on GetProcAddress is officially FARPROC, in practice most people don't write it out that way (`FAR` comes from ancient times). You could avoid spelling it out by combining it with the cast.

And for consistency with the GetModuleHandleW, I'd suggest GetProcAddressW. So roughly:

auto func = reinterpret_cast<decltype(&GetProcessMitigationPolicy)>(GetProcAddressW(..., L"...");
if (func) ...

::: xpcom/base/nsMemoryReporterManager.cpp:2306
(Diff revision 2)
>  NS_IMETHODIMP
>  nsMemoryReporterManager::GetVsize(int64_t* aVsize)
>  {
>  #ifdef HAVE_VSIZE_AND_RESIDENT_REPORTERS
> +  if (mIsCFGEnabled) {
> +    aVsize = 0;

This should be `*aVsize` and same in the function below, though you could also change the ifdef to avoid the duplication:

#ifdef HAVE...
if (!mIsCFGEnabled) return ...
#endif // <-- no longer an #else
*aVsize  = 0;
Attachment #8953602 - Flags: review?(dmajor)
My other duties are keeping me from working on this except infrequently. I'm going to unassign myself in the hope that someone with more time might find the motivation to drive it to completion.
Assignee: chutten → nobody
Status: ASSIGNED → NEW
No longer blocks: 1439222
See Also: → 1439222
Whiteboard: [MemShrink] → [MemShrink:P2]
Moving to p3 because no activity for at least 24 weeks.
Priority: P1 → P3
I expect this is no longer valid after the switch to clang, since we don't have CFG enabled anymore.
Flags: needinfo?(chutten)
Did we switch in July? If so, this would seem to bear that out. https://mzl.la/2AE4Zjl
Flags: needinfo?(chutten)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: