Closed Bug 1418296 Opened 4 years ago Closed 4 years ago

3.73% Resident Memory (linux32-stylo-disabled and win32) regression on push e74e4ec2cb39 (Wed Nov 15 2017)

Categories

(Core :: JavaScript: Internationalization API, defect, P2)

x86
Linux
defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox-esr52 --- unaffected
firefox58 --- unaffected
firefox59 --- wontfix
firefox60 --- wontfix

People

(Reporter: igoldan, Assigned: m_kato, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression, Whiteboard: [MemShrink:P1][stylo])

Attachments

(1 file)

We have detected an awsy regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=815b8640da1e3029dd680e4187b68852d36e2a4e&tochange=e74e4ec2cb39d737da0e16d0645e9dbdc564acd1

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  4%  Resident Memory linux32-stylo-disabled opt stylo-disabled     391,463,779.41 -> 406,046,931.70


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=10587

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/AWSY/Tests
Component: Untriaged → JavaScript: Internationalization API
Product: Firefox → Core
:anba Please look for this regression. Can we fix this or should we backout/accept?
Flags: needinfo?(andrebargull)
ICU is nowadays too widely used in Firefox which rules out a backout from my point of view, because we always want to stay on the latest version for security fixes and to use up-to-date localization data. Fixing the regression is difficult/impossible without having any information which data exactly caused in the increased resident memory usage. It could be:
- New or changed localization data in ICU
- New or changed code in ICU
- Data structures in Gecko which cache or use data from ICU and now need to store more data because ICU provides more data
- Data structures in SpiderMonkey which cache or use data from ICU ...
- bug 1405993 also included a Unicode update, maybe additional Unicode data increased the memory usage
- and so forth.


:Waldo, do you think there's anything actionable for this bug?
Flags: needinfo?(andrebargull) → needinfo?(jwalden+bmo)
Priority: -- → P2
(In reply to André Bargull [:anba] from comment #2)
> Fixing the regression is difficult/impossible without having any information which data
> exactly caused in the increased resident memory usage.
All our subtests saw at least 3% regressions. These ones are bigger and more consistent:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=907fb4683dfef81fbb746035939ac8154b66dfe5&newProject=mozilla-inbound&newRevision=94111a55e8c3dfc3cb526650c7a18ea9e7f5d85e&originalSignature=31fd0678d8e90fb268261b5df34a7e08897caaf4&newSignature=31fd0678d8e90fb268261b5df34a7e08897caaf4&showOnlyImportant=1&framework=4
Whiteboard: [memshrink]
(In reply to André Bargull [:anba] from comment #2)
> ICU is nowadays too widely used in Firefox which rules out a backout from my
> point of view, because we always want to stay on the latest version for
> security fixes and to use up-to-date localization data. Fixing the
> regression is difficult/impossible without having any information which data
> exactly caused in the increased resident memory usage. It could be:
> - New or changed localization data in ICU
> - New or changed code in ICU
> - Data structures in Gecko which cache or use data from ICU and now need to
> store more data because ICU provides more data
> - Data structures in SpiderMonkey which cache or use data from ICU ...
> - bug 1405993 also included a Unicode update, maybe additional Unicode data
> increased the memory usage
> - and so forth.
> 
> 
> :Waldo, do you think there's anything actionable for this bug?

I'd like to separate the need for security updates from large table updates. 

Kato-san: can you suggest a way to strip this down? Thx!
Flags: needinfo?(m_kato)
Separating out security updates from ICU bumps overall, I think, would require us to audit every upstream commit to ICU and perhaps also to CLDR (for we could imagine changes in CLDR data tickling bugs in ICU code parsing that, to build up ICU-specific serialized representations).  This is not something we have historically done.  I'm not sure it's something we reasonably *can* do.  We'd have to become experts in understanding all aspects of ICU to understand what changes might or might not have security implications.  And the mere attempt to do so has its own risks.

Do we have any sorts of tools we could use to divine what precise thing allocated more memory here?  Could we add some sort of logging code to pre/post-update builds to dump ICU memory usage to figure this out maybe?
Flags: needinfo?(jwalden+bmo)
[Tracking Requested - why for this release]: large memory regression that is not understood
We can't just ship large memory regressions because nobody understands the code we're shipping.
> We can't just ship large memory regressions because nobody understands the code we're shipping.

Agree. I believe the solution is to invest in understanding ICU/CLDR better including:
 - build system integration
 - how we select what we build
 - what tables/languages get added/removed between versions of CLDR/ICU
 - what is the package size / memory / perf impact of an update

As far as I know :m_kato is the only person occasionally investigating those issues, while :waldo, :anba and :jfkthame put work to maintain our integration and prevent technical debt from accruing by updating us to latest ICU/CLDR/Unicode.

Steven Loomis of IBM has been working on tooling to improve selection/slicing of data tables when building ICU, and I filed bug 1402379 to get us to a slightly better place on how we handle ICU/CLDR/Intl API integration into SpiderMonkey and Gecko.
Flags: needinfo?(srl)
ICU data is 600KB+ by 60.1.  Humm, I will create en-US only package to compare memory usages.

BTW, why is this issue linux32-stylo-disabled only?  (Why doesn't we run on linux32?)
Flags: needinfo?(m_kato)
Did this regression show up on any other platforms besides Linux32?
Flags: needinfo?(igoldan)
Humm, does it depends on OS page size?   "stylo-disabled" isn't real configuration as user.  I wonder that we don't run awsy on linux (style enabled).
stylo-disabled is a real config for android users and users who disable it for specific extensions.  I am curious why this isn't run for regular linux32 and it looks like we intentionally shut it off in bug 1365708.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #13)
> stylo-disabled is a real config for android users and users who disable it
> for specific extensions.  I am curious why this isn't run for regular
> linux32 and it looks like we intentionally shut it off in bug 1365708.

Given 32-bit Linux is such a small percentage of users I think we decided it wasn't worth  dedicating resources to perf testing. It does appear we regressed the win32 tabs open measures as well [1]. I'll spin up a few more runs to get better numbers.

FWIW we enabled stylo on android in bug 1366049. Eventually we're removing the old style system in bug 1395112, so disabling it won't be an option. The main advantage to testing with stylo disabled is that we can compare memory usage of the new style system to the old one, with 57 shipped that probably less important.

[1] https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=907fb4683dfef81fbb746035939ac8154b66dfe5&newProject=mozilla-inbound&newRevision=94111a55e8c3dfc3cb526650c7a18ea9e7f5d85e&originalSignature=c10a99a5c6a58850b7f9039fc7a4eb7b9ca2fc46&newSignature=c10a99a5c6a58850b7f9039fc7a4eb7b9ca2fc46&framework=4
I can track this for 59, and follow up next week to see if there is progress.
Assignee: nobody → m_kato
Can you reproduce the memory increase using only ICU api calls? We frequently run Valgrind against ICU to make sure all memory is cleaned up. So there should not be wasted memory.
Flags: needinfo?(srl)
(In reply to Steven R. Loomis from comment #16)
> Can you reproduce the memory increase using only ICU api calls? We
> frequently run Valgrind against ICU to make sure all memory is cleaned up.
> So there should not be wasted memory.

This isn't a leak. This is an increase in memory usage while Firefox is running.
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #14)
> (In reply to Joel Maher ( :jmaher) (UTC-5) from comment #13)
> > stylo-disabled is a real config for android users and users who disable it
> > for specific extensions.  I am curious why this isn't run for regular
> > linux32 and it looks like we intentionally shut it off in bug 1365708.
> 
> Given 32-bit Linux is such a small percentage of users I think we decided it
> wasn't worth  dedicating resources to perf testing.

I'm inclined to firefox-59:fix-optional this, then, but given this:

> FWIW we enabled stylo on android in bug 1366049. Eventually we're removing
> the old style system in bug 1395112, so disabling it won't be an option. The
> main advantage to testing with stylo disabled is that we can compare memory
> usage of the new style system to the old one, with 57 shipped that probably
> less important.

do we need to keep this open to track regressions with stylo on android?
This doesn't have anything to do with Stylo being enabled or not. We just happen to only run this particular configuration with Stylo disabled. Per comment 13, I think this affects all 32 bit systems.
Summary: 3.73% Resident Memory (linux32-stylo-disabled) regression on push e74e4ec2cb39 (Wed Nov 15 2017) → 3.73% Resident Memory (linux32-stylo-disabled and win32) regression on push e74e4ec2cb39 (Wed Nov 15 2017)
(In reply to Andrew McCreight [:mccr8] from comment #17)
> (In reply to Steven R. Loomis from comment #16)
> > Can you reproduce the memory increase using only ICU api calls? We
> > frequently run Valgrind against ICU to make sure all memory is cleaned up.
> > So there should not be wasted memory.
> 
> This isn't a leak. This is an increase in memory usage while Firefox is
> running.

OK, thanks.  This is comparing ICU 59.1 to 60.1 ?  ICU does have configurable caching mechanisms, perhaps there's something that can be tweaked. A separate use case (just calling ICU code) that simulates your use case and reproduces this difference would be helpful.
FWIW I think we're looking at more like a 45MiB regression in the 'Tabs Open' measure of RSS on Win32. That measure is pretty noisy so it's hard to tell for sure, but this is rather concerning.
I was also pointed to "Smaller files in breakiterator 9954" http://site.icu-project.org/download/60#TOC-Common-Changes — BreakIterator does use more memory in 60 in exchange for smaller data files (1.2mb to 0.8 mb), because it caches break positions.
This was independently filed upstream as https://ssl.icu-project.org/trac/ticket/13532
(In reply to Steven R. Loomis from comment #22)
> I was also pointed to "Smaller files in breakiterator 9954"
> http://site.icu-project.org/download/60#TOC-Common-Changes — BreakIterator
> does use more memory in 60 in exchange for smaller data files (1.2mb to 0.8
> mb), because it caches break positions.

breakiterator might use Intl.RelativeTimeFormat only by bug 1270140 and we don't add break rule data file in our icudt*.dat.
Since this bug is not caused by Stylo, I am adding a whiteboard tag to remove this bug from the Stylo bug queries.
Whiteboard: [memshrink] → [memshrink][stylo]
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6b8e2af73aa5bd730dd1179eb013f0f44e5bf6f3

Also, when I shrink dat file (en and root only) for testing, it doesn't reduce resident memory.
Whiteboard: [memshrink][stylo] → [MemShrink:P1][stylo]
Are we able to do anything about this regression or do we just have to eat it since we want to take the new version of ICU?
Flags: needinfo?(m_kato)
(In reply to Andrew Overholt [:overholt] from comment #27)
> Are we able to do anything about this regression or do we just have to eat
> it since we want to take the new version of ICU?

Has anyone evaluated https://ssl.icu-project.org/trac/ticket/13532 compared to the Mozilla code? For example, are you "creating large numbers of break iterators" where they might be able to be reused? It's possible the caches have some control on them.
(In reply to Steven R. Loomis from comment #28)
> (In reply to Andrew Overholt [:overholt] from comment #27)
> > Are we able to do anything about this regression or do we just have to eat
> > it since we want to take the new version of ICU?
> 
> Has anyone evaluated https://ssl.icu-project.org/trac/ticket/13532 compared
> to the Mozilla code? For example, are you "creating large numbers of break
> iterators" where they might be able to be reused? It's possible the caches
> have some control on them.

There's a linked ticket with an initial patch… it is aimed at speeding up initial time, but may reduce heap also.
This might be regression by the following ICU landing according to try server.

svn log -r r40076
------------------------------------------------------------------------
r40076 | roubert | 2017-04-27 05:12:01 +0900 (Thu, 27 Apr 2017) | 1 line

ticket:13094: Handle empty language subtag in parseTagString().
------------------------------------------------------------------------
(In reply to Steven R. Loomis from comment #28)
> (In reply to Andrew Overholt [:overholt] from comment #27)
> > Are we able to do anything about this regression or do we just have to eat
> > it since we want to take the new version of ICU?
> 
> Has anyone evaluated https://ssl.icu-project.org/trac/ticket/13532 compared
> to the Mozilla code? For example, are you "creating large numbers of break
> iterators" where they might be able to be reused? It's possible the caches
> have some control on them.

We don't use breaker rule.  So even if I apply that patch, it doesn't improve.
(In reply to Makoto Kato [:m_kato] from comment #31)
> This might be regression by the following ICU landing according to try
> server.
> 
> svn log -r r40076
> ------------------------------------------------------------------------
> r40076 | roubert | 2017-04-27 05:12:01 +0900 (Thu, 27 Apr 2017) | 1 line
> 
> ticket:13094: Handle empty language subtag in parseTagString().
> ------------------------------------------------------------------------

http://bugs.icu-project.org/trac/changeset/40076 caused a 3.73% increase in memory usage? Anyway, need info about the content of the used memory, what you're actually seeing.
Flags: needinfo?(m_kato)
(In reply to Steven R. Loomis from comment #33)
> (In reply to Makoto Kato [:m_kato] from comment #31)
> > This might be regression by the following ICU landing according to try
> > server.
> > 
> > svn log -r r40076
> > ------------------------------------------------------------------------
> > r40076 | roubert | 2017-04-27 05:12:01 +0900 (Thu, 27 Apr 2017) | 1 line
> > 
> > ticket:13094: Handle empty language subtag in parseTagString().
> > ------------------------------------------------------------------------
> 
> http://bugs.icu-project.org/trac/changeset/40076 caused a 3.73% increase in
> memory usage? Anyway, need info about the content of the used memory, what
> you're actually seeing.

As long as our test infra, it causes this issue.   When I backed out this from 60.2, this was resolved.  Do you know the reason  why Resident Memory (See https://wiki.mozilla.org/AWSY/Tests)  is incremented by r40076?
Flags: needinfo?(srl)
http://people.mozilla.org/%7Ejmaher/taloszips/zips/tp5n.zip is no loner available, so where is current tp5n.zip ?
Flags: needinfo?(jmaher)
ICU 60.2
Resident Memory opt stylo-disabled: 468,086,541.99

ICU 60.2 + backed out r40076
Resident Memory opt stylo-disabled: 457,160,950.85

Although I add an assertion that this condition (*langLength == 0 and _isIDSeparator(*position) is true) is hit, it doesn't stop by that assertion.  This might be compiled code issue.
you can find tp5n.zip here:
https://github.com/rwood-moz/talos-pagesets/
Flags: needinfo?(jmaher)
:waldo, according to try server's AWSY data, this causes by http://bugs.icu-project.org/trac/changeset/40076 (http://bugs.icu-project.org/trac/ticket/13094).  I add assertion with (*langLength == 0 and _isIDSeparator(*position) is true), tp5n doesn't becomes this condition.  So I think that this issue depends on generated code by compiler.

Of course, when we back out that change, the usage of Resident Memory is back.  But I think that this is risky as long as ICU's ticket.  And AWSY on linux32 is already disabled.   So I recommend that this should be wontfix.  How do you think?
Flags: needinfo?(jwalden+bmo)
(In reply to Makoto Kato [:m_kato] from comment #39)
> :waldo, according to try server's AWSY data, this causes by
> http://bugs.icu-project.org/trac/changeset/40076
> (http://bugs.icu-project.org/trac/ticket/13094).  I add assertion with
> (*langLength == 0 and _isIDSeparator(*position) is true), tp5n doesn't
> becomes this condition.  So I think that this issue depends on generated
> code by compiler.
> 
> Of course, when we back out that change, the usage of Resident Memory is
> back.  But I think that this is risky as long as ICU's ticket.  And AWSY on
> linux32 is already disabled.   So I recommend that this should be wontfix. 
> How do you think?

This is an issue on win32 as well (see comment 21). Given we know the exact patch that is regressing it seems like we can figure out a fix.
We're running out of time in the Fx59 development cycle to fix this. Are there any recent updates?
I grabbed a copy of the try repository plus the two revs tested to diagnose the RSS regression.  This is the full diff.  Nothing in this except the one changed line.

Looking back at the original change in Trac and discussion of it, the patch logic checks out.  It's definitely the correct and proper change for what the code is trying to do.  It would actively reintroduce the problem identified in that change, to revert it.  Cherrypicking reverting this one change would not be correct.

I don't see a plausible mechanism by which this one-line change leads to a 25-45MB regression.  The statement that was moved from an else-arm to happen all the time, should only hit if a tag is parsed that lacks a language component -- but this shouldn't happen in the browser at all for JS on the web.  Every language tag SpiderMonkey hands off to ICU, it's first syntax-checked to verify that the tag is structurally valid per BCP47.  BCP47 (in contrast to the structure of the $LANG environment variable, that this function was additionally written to process) requires a language component in language tags.  ICU probably does use this function to parse $LANG on Firefox's behalf -- but this would only likely happen rarely.  Maybe only when getting the default locale in Intl code (and we actually cache that value pretty hard, so even that would happen not regularly).  Besides, I rather doubt treeherder is running with one of these weird language-less $LANG in the first place.

I also repushed the two revs in comment 34 to double-check the RSS stats:

Before:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8a9233cfcfc9186bc336d87bc284a4e6fedd868e&selectedJob=161078449
After:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=35930e7a008de82c0cce926f939772ede5ffc067

The linux32 sy numbers are all over the map.  I don't see a consistent regression in them at all.  Nor do I see one in win32 numbers.

m_kato also tells me he set up tp5 locally to try to find the regression alleged in comment 14 and comment 21 and wasn't seeing it.

Given that https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=mozilla-inbound,1557588,1,4&series=mozilla-inbound,1557638,1,4&series=mozilla-inbound,1308546,1,4&selected=mozilla-inbound,1308546,284784,380134165,4 isn't really revealing a regression here, and given that the mechanism by which a one-line change would result in such vast memory swing is also unclear -- at worst I would expect cost linear in the number of almost-always-small language tags parsed -- I am increasingly convinced we're just shadowboxing here, and there is not and never was an issue.

So: at this point this whole bug feels like an INCOMPLETE or INVALID to me.  Either the two identified revs are not the cause of the regression, or there isn't a regression at all.  And given the long-term graph doesn't show a regression, that seems much likelier to me.
Flags: needinfo?(jwalden+bmo)
Alright, I guess we'll close this and hope that this is indeed a fluke. Thanks for investigating.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.