Closed Bug 1220693 Opened 9 years ago Closed 9 years ago

6-14% linux64/win* regressions in ts_paint/sessionrestore/tpaint/etc. on Mozilla-Inbound (v.45) on Oct 31, 2015 from revision 7b23b9f96a36

Categories

(Core :: JavaScript: Internationalization API, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla45
Tracking Status
firefox44 --- unaffected
firefox45 --- fixed

People

(Reporter: jmaher, Assigned: Waldo)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(3 files, 1 obsolete file)

Talos has detected a Firefox performance regression from jwalden@mit.edu's commit https://hg.mozilla.org/integration/mozilla-inbound/rev/7b23b9f96a36 in bug 1208808.

*** PERF SHERIFFS WILL BACK OUT THE OFFENDING PATCH IN 48 HOURS! ***

This patch has regressed a reliable test on Windows by more than 5% so it will be backed out in 48 hours unless the author fixes the regression or explains why the regression should be ignored.

DETAILS OF 48HR BACKOUT POLICY:

This is a new policy and you can learn more about it here: https://groups.google.com/forum/#!topic/mozilla.dev.platform/QHdn-ogf8kQ

Still have questions? Get in touch with:
* A-Team perf-sheriffs: jmaher or vaibhav1994
* Perf team: vladan or avih
* A-Team perf tooling: wlach

DETAILS OF THE REGRESSION:

The regression is from:
* Pushlog: https://hg.mozilla.org/integration/mozilla-inbound/rev/7b23b9f96a36* Treeherder link: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=ef228c9912df&tochange=4a2306162371&filter-searchStr=mozilla-inbound%20talos%20other

We have re-triggered all Talos tests 5 times each for revision 7b23b9f96a36 and the previous revision 4744545d275c. Here is a list of all known regressions and improvements related to your bug:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=4744545d275c&newProject=mozilla-inbound&newRevision=7b23b9f96a36

FIXING THE REGRESSION:

- To learn more about the regressing test, see https://wiki.mozilla.org/Buildbot/Talos/Tests#ts_paint
- If you would like to test a potential fix, push to try with the following syntax:
try: -b o -p win64,win32 -u none -t other
Note that try pushes with Talos jobs will automatically link to a perfherder "comparison" view to show how much your fix improved performance.
You can add "mozharness: --spsProfile" to the try line above to capture a Gecko Profiler profile during the test run. 
- Alternately, to run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a ts_paint

Here is a list of alerts generated from graph server:
http://alertmanager.allizom.org:8080/alerts.html?rev=7b23b9f96a36cf4675191422d06ca8daddb736c8&showAll=1&testIndex=0&platIndex=0
:Jeff, can you look at this and let us know your plans to fix this?  The tests are now pretty noisy and we need to resolve this pretty quickly.
Flags: needinfo?(jwalden+bmo)
I am not sure who to cc, the author of the patch is jwalden@mit.edu, sadly that isn't a registered address in here which means we will have to trust that jwalden+bmo@mit.edu is the right person to ask for needinfo.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=342bfed777a3 simply comments out the ICU recreateDefault in JS::ResetTimeZone(), and it seems to recover the perf.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=4744545d275c&newProject=try&newRevision=342bfed777a3&showOnlyImportant=0

I think we can defer that ICU updating, somewhat, until we actually need to consult ICU for date/time formatting/computation operations.  Maybe.  I'll see if I can get something along those lines done, recovering things on all platforms.
Flags: needinfo?(jwalden+bmo)
Egad, rabbit holes.  All this just to use an enum class for nice enum-qualifications, and only because (it looks like) OS X alone doesn't have reliable <atomic> yet.  Bah.
Attachment #8682357 - Flags: review?(nfroyd)
Assignee: nobody → jwalden+bmo
Status: NEW → ASSIGNED
I've gotten enough try-server/perfherder feedback on the approach being right that I think this is ready to review.  Not to mention that 48-hour hardnosed backout time and all.  Bah**2.

Try-run of both patches together, 'cause obviously changing Atomics.h (not to mention these changes) is a bit tricky, and the last try-run of just this patch was broken a few places because of that lack of Atomics.h changing:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3fff215dd4e7
Attachment #8682359 - Flags: review?(till)
:Waldo, thanks for jumping on this!  I looked over the try push and the results are much better.  Looking forward to resolving this :)
Comment on attachment 8682357 [details] [diff] [review]
Make mozilla::Atomic<enum class> work even on compilers that don't have <atomic>

Review of attachment 8682357 [details] [diff] [review]:
-----------------------------------------------------------------

Just...wow.

::: mfbt/Atomics.h
@@ +320,5 @@
>  {
>  };
>  
> +template<typename T>
> +struct ToStorageTypeArgument

I don't think you need this, given the definition(s) of ToStorageTypeArgument further down?
Attachment #8682357 - Flags: review?(nfroyd) → review+
Comment on attachment 8682359 [details] [diff] [review]
Lazily trigger ICU default-time-zone recreation only at the instant where the new default time zone is going to be used, rather than every time the time zone might have changed, as apparently ICU's default-time-zone computation is a lot of work

Review of attachment 8682359 [details] [diff] [review]:
-----------------------------------------------------------------

till is on PTO and stymied by 2fa, but he looked over the diff and gave r+ over email.  (OR DID HE?, claiming not to have 2fa.  ;-) )
Attachment #8682359 - Flags: review?(till) → review+
/me remembers to use the 2fa excuse :)
thanks, this fixed the data on talos!
https://hg.mozilla.org/mozilla-central/rev/12ee37915e07
https://hg.mozilla.org/mozilla-central/rev/ee397c027df9
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
This change regressed bug 1208808
(In reply to Carsten Book [:Tomcat] from comment #13)
> https://hg.mozilla.org/releases/mozilla-b2g44_v2_5/rev/12ee37915e07
> https://hg.mozilla.org/releases/mozilla-b2g44_v2_5/rev/ee397c027df9

this got reverted since we reverted that wrong merge but seems bugzilla had problems in this bug to reset the flags
So, um, the patch here was stupidly wrong kind of, in that I forgot to release the implicit lock I was trying to implement.  Wut.  Fixed that.  Also, as peoples on IRC noted, and as I realized/concluded when I wrote out the comment there describing the state transitions, the current code can coalesce multiple JS::ResetTimeZone() into a single icu::TimeZone::recreateDefault().  Which on second thought isn't quite the semantics we want here, if two resets happened super-rapidly.  So make ResetTimeZone's compare-exchange be attempted til it succeeds, so that one ResetTimeZone triggers exactly one recreateDefault().

sfink is actually around now and not either on PTO or sick or something, plus it's not late'o'clock where he is, so poking this at him, and putting a f?till so he's aware of us being stupid here and maybe will learn or something.  :-)

I still have no idea how I forgot to release the lock here.  Bah.
Attachment #8688714 - Flags: review?(sphink)
Attachment #8688714 - Flags: feedback?(till)
Comment on attachment 8688714 [details] [diff] [review]
Actually implement the icu::TimeZone::recreateDefault() locking scheme at all properly

Review of attachment 8688714 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good enough to me to land, but I'll confess I'm not really that familiar with concurrency models, and the one thing that makes me nervous here is that I don't understand why the

  js::DefaultTimeZoneStatus = IcuTimeZoneStatus::Valid;

write would necessarily ever get flushed from CPU 1's cache so that CPU 2's

  while (!js::DefaultTimeZoneStatus.compareExchange(Valid, NeedsUpdate)) {}

would stop spinning. At least not until CPU 1 does another compareExchange, at one or the other callsites, which I assume would force the flush. Is there a way to force that write to be visible? Or am I just inventing problems here that aren't an issue in reality because <mumble> <mumble> C++11 <mumble> modern processors <mumble>?
Attachment #8688714 - Flags: review?(sphink) → review+
Ok, to answer my own question, I *think* the answer is that DefaultTimeZoneStatus being atomic means that it has acquire-release semantics. Specifically, the write is considered a release, and the reads in the compareExchange spinloop are acquires. That gives us a synchronizes-with relationship from the write to the read, so the spinloop is guaranteed to see the updated value.

I think. The above guess is based on http://preshing.com/20130823/the-synchronizes-with-relation/
Comment on attachment 8688714 [details] [diff] [review]
Actually implement the icu::TimeZone::recreateDefault() locking scheme at all properly

Review of attachment 8688714 [details] [diff] [review]:
-----------------------------------------------------------------

Stupid me, thinking I'd be qualified to review this :/

I fully agree with what sfink says. But only if he's right.
Attachment #8688714 - Flags: feedback?(till) → feedback+
So the previous patch works, sort of.  But if we're looping until a Valid->NeedsUpdate occurs, that means that |JS::ResetTimeZone(); JS::ResetTimeZone();| will hang until some other thread does an Intl.DateTimeFormat creation.  Which, uh, clearly not.

I fiddled around trying to come up with a lock-free algorithm for this and came up basically dry.  (lth came up with something that was totally crazy and clearly unshippable, we both agreed.  :-) )  So just do the absolute stupidest thing and track ICU default time zone status behind a spinlock.  Doesn't get any simpler and obviously correct than this, and I can get on with my life.
Attachment #8689772 - Flags: review?(sphink)
Attachment #8689772 - Flags: feedback?(till)
Attachment #8688714 - Attachment is obsolete: true
Comment on attachment 8689772 [details] [diff] [review]
Put the needs-update status of the ICU default time zone behind a spinlock, as the simplest/stupidest possible thing that works

Review of attachment 8689772 [details] [diff] [review]:
-----------------------------------------------------------------

Ugh, sorry I missed that.

And it's not enough to change

  while (cmpxchg(state, valid, needsupdate))
    continue;

to

  while (state == needsupdate || cmpxchg(state, valid, needsupdate))
    continue;

? (/me intentionally resists thinking too hard about it.)

Anyway, the new code looks ok afaict. Though I keep having visions of lots of threads spinning while the update is happening (and then one proceeds, and the rest spin, and by the time they all process, the first thread decides it needs to reset it yet again...)

But I guess that's what you get if you keep resetting the darn timezone.
Attachment #8689772 - Flags: review?(sphink) → review+
Reading a non-atomic field across threads without synchronization is a data race that officially triggers undefined behavior, and that TSan won't like.  You have to have some sort of syncing primitive in play for it to be okay to read across threads, even for a while-loop thing like this.

As for the contending thing, yeah, not worth caring.  :-)
Comment on attachment 8689772 [details] [diff] [review]
Put the needs-update status of the ICU default time zone behind a spinlock, as the simplest/stupidest possible thing that works

Approval Request Comment
[Feature/regressing bug #]: bug 1208808, this bug's prior patch

[User impact if declined]: changing time zone on b2g won't be reflected in the behavior of Intl.DateTimeFormat instances created using the local time zone, or local-time-based Date toLocaleString/toDateString/toTimeString -- but *will*, inconsistently, be reflected in Date.prototype.toLocaleFormat; we really want the two cases to be consistent

[Describe test coverage new/current, TreeHerder]: none; I think a test *could* be written, but it's not clear the effort is worthwhile, vs. relying on the manual testing of the sort that resulted in bug 1208808

[Risks and why]: low -- converts a trickier approach to a bog-standard spinlock; also this *fixes* significant brokenness of the current code, so is clearly desirable

[String/UUID change made/needed]: N/A
Attachment #8689772 - Flags: approval-mozilla-aurora?
Zibi, would you be able to verify on the latest Nightly build that the issue you reported in bug 1208808 is fixed as expected? Thanks!
Flags: needinfo?(gandalf)
I can confirm that on latest (20151124) builds for flame and aries I can no longer reproduce the bug and date/time changes and adjusts properly.
Flags: needinfo?(gandalf)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #27)
> I can confirm that on latest (20151124) builds for flame and aries I can no
> longer reproduce the bug and date/time changes and adjusts properly.

Thanks! Changing the status to verified.
Status: RESOLVED → VERIFIED
Comment on attachment 8689772 [details] [diff] [review]
Put the needs-update status of the ICU default time zone behind a spinlock, as the simplest/stupidest possible thing that works

Given that this patch was verified and baked in Nightly for 2 days, I feel comfortable taking the fix in Aurora44.
Attachment #8689772 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
has problems during aurora uplift, 

grafting 316194:4c3fe9bc8857 "Bug 1220693 - Put the needs-update status of the ICU default time zone behind a spinlock, as the simplest/stupidest possible thing that works.  r=sfink"
merging js/src/builtin/Intl.cpp
merging js/src/vm/DateTime.cpp
merging js/src/vm/DateTime.h
warning: conflicts while merging js/src/builtin/Intl.cpp! (edit, then use 'hg resolve --mark')
warning: conflicts while merging js/src/vm/DateTime.cpp! (edit, then use 'hg resolve --mark')
warning: conflicts while merging js/src/vm/DateTime.h! (edit, then use 'hg resolve --mark')
abort: unresolved conflicts, can't continue
(use hg resolve and hg graft --continue)
could you take a look, thanks!
Flags: needinfo?(jwalden+bmo)
Hmm.  I could have sworn I checked the timing when I made the aurora request, but it looks like bug 1208808's patch landed *after* the last uplift (October 29), on October 31 per bug 1208808 comment 24 (and was merged into m-c on November 2, even further after October 29).  So I guess nothing needs to be done here, unless I'm somehow further screwing up this bug-reading.  :-\
Flags: needinfo?(jwalden+bmo)
Attachment #8689772 - Flags: feedback?(till)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: