Closed Bug 1631593 Opened 6 years ago Closed 5 years ago

Make Localization.jsm stateless

Categories

(Core :: Internationalization, task, P3)

task

Tracking

()

RESOLVED FIXED
mozilla79
Tracking Status
firefox79 --- fixed

People

(Reporter: zbraniecki, Assigned: zbraniecki)

References

(Blocks 1 open bug)

Details

Attachments

(10 files, 1 obsolete file)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
5.56 KB, text/plain
Details
6.02 KB, patch
Details | Diff | Splinter Review
1.21 KB, patch
Details | Diff | Splinter Review
47 bytes, text/x-phabricator-request
Details | Review

As the next step on the path to bug 1613705 after bug 1627809 I'd like to make JSM class stateless.

This would mean that all state is stored in C++ and JSM is called only to perform async iteration. This would make it easy to load the JSM lazily the first time we need to translate things until we remove it completely.

Blocks: 1613705
Depends on: 1627809
Assignee: nobody → gandalf
Status: NEW → ASSIGNED

With this patchset (on top of bug 1627809), and if I disable localization of 6 strings from Web Components, I can once again lazy-load JSM only when the initial translation is required!

I'll clean it all up a bit and measure perf.

Olli, this is not yet a review request, but I'm running into a strange performance regression between D71814 and D71815.

The whole patchset shows a set of regressions in about_newtab_with_snippets, about_newtab_with_snippets responsiveness and tabswitch

Here's a perfherder between bug 1627809, and this patchset: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8a7e913268d4abf76b3a37234e178f91a4739822&newProject=try&newRevision=7fe2a312f72cab5eb4434e13286a1434ef6991a0&framework=1

I went through the patchset here one by one, and the first one to show a significant regression is D71815. Here's perfherder for it - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=213caddd78ada33c36794113323274bf83161621&newProject=try&newRevision=4b9bf729a0cbd7d5a804a560efb5ac24a6565e7f&framework=1

Before this patch, I store generateBundles and generateBundlesSync in C++ and pass it to Localization.jsm where in onChange [0] generates bundle iterator (this.bundles) and stores it in JS.
In most runtime scenarios generateBundles and generateBundlesSync is null and Localization.jsm uses default generator functions.

After this patch, I call regenerateBundles in JSM which returns the bundle iterator to C++ where it is stored as mBundles. Then I pass it to each format* function.

The only thing I can say is that while custom generateBundles and generateBundlesSync is rare, the iterator stored in mBundles is there always.
I wanted to move in this direction so that next I can move the mBundles generation to L10nRegistry and eventually move the remaining functions to C++ and remove Localization.jsm all together.

Can you take a look at this patch and tell me if anything potentially performance sensitive looks suspicious in it - maybe rooting is costly? Or maybe there's a better way to store an async iterator over FluentBundle (IDL backed)?

[0] https://searchfox.org/mozilla-central/source/intl/l10n/Localization.jsm#499

Flags: needinfo?(bugs)

I don't see immediately anything suspicious. Perf profiles would be good.

Flags: needinfo?(bugs)
Priority: -- → P3
Attached file minimized testcase

Here's the minimized testcase of the problem.

And here's talos run: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f1e988e5c81fd67012f460c59060f55e159d9781&newProject=try&newRevision=c098de981f2fc644f728cd08c43bc81fe3e49147&framework=1

As per previous talos runs, there's no impact on tspaint, tpaint or sessionrestore, but there is a quite substantial one on about_newtab_with_snippets - both firstpaint and responsiveness.

The minimized testcase is a bit oversimplified - in the real patch I am moving this.bundles to be stored in C++, but that requires an update to a number of functions and I verified that the moment the regression appears is when I assign the result of the call to JSM (which is the iterator returned by the cached method) to mBundles.
It happens both in the full D71815 patch and in this minimized testcase, so I hope this testcase is going to be useful to track down the reason.

I worked today with :emilio and :smaug on #dom looking for what may cause it and we couldn't find anything in CC log, profile etc. but I noticed that it doesn't seem like the regression appears with tpcycles 1 and it only is visible with tpcycles 10 (default in talos is 20).
It may be some leak, or something related to garbage collection, and :smaug suggested that maybe there's some large object that we have to allocate/deallocate and trace from C++ with the patch.

:emilio noticed that the about:newtab uses fluent-react which uses its own Localization instance and calls to L10nRegistry, but I'm not sure if it also uses the WebIDL Localization or not. If not, why does this change affect that page at all?

I'm going to NI :emilio and :smaug in hopes that they may find time to help me further figure that out.

Flags: needinfo?(emilio)
Flags: needinfo?(bugs)

I think there's a way to get a profile from a Talos run... That's the easiest way to figure this out. Alexandru, do you know how?

Flags: needinfo?(emilio) → needinfo?(aionescu)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #11)

I think there's a way to get a profile from a Talos run... That's the easiest way to figure this out. Alexandru, do you know how?

You can pass --gecko-profile to the try push and then they show up when selecting a run in treeherder. In this case you'd probably want something like ./mach try fuzzy -q "'talos-chrome-e10s" --gecko-profile.

I'll try that next.

Removing r?'s because I can't be sure this patchset will end up being a possible way forward.

(In reply to Brian Grinstead [:bgrins] from comment #12)

You can pass --gecko-profile to the try push and then they show up when selecting a run in treeherder. In this case you'd probably want something like ./mach try fuzzy -q "'talos-chrome-e10s" --gecko-profile.

The way I've been launching the talos run is:

./mach try -b o -p all -u none -t chromez --rebuild-talos 20

since adding --gecko-profiler is not working with this syntax, do you know how can I produce that? fuzzy doesn't seem to allow to put --rebuild-talos 20.

Flags: needinfo?(bgrinstead)
Flags: needinfo?(aionescu)

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #14)

(In reply to Brian Grinstead [:bgrins] from comment #12)

You can pass --gecko-profile to the try push and then they show up when selecting a run in treeherder. In this case you'd probably want something like ./mach try fuzzy -q "'talos-chrome-e10s" --gecko-profile.

The way I've been launching the talos run is:

./mach try -b o -p all -u none -t chromez --rebuild-talos 20

since adding --gecko-profiler is not working with this syntax, do you know how can I produce that? fuzzy doesn't seem to allow to put --rebuild-talos 20.

If you are taking a profile then you shouldn't need rebuilds - you don't need the pushes for comparison, just to gather a single profile.

But if you do want to re-run talos with fuzzy then you can use the --rebuild flag instead of --rebuild-talos

Flags: needinfo?(bgrinstead)

I ran two try runs as per comment 12 suggestion:

Unfortunately I don't seem to be getting talos numbers out of them: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1c06a5c411fb3e50d5501044b3269a2dc6196834&newProject=try&newRevision=92b6e538241febd1516aca86107c1beba673c533&framework=1

which removes my ability to verify if the regression is visible in that try run.

But maybe there is a profile that may be useful out of it?

Olli, Brian helped me to fish the profiles out of the runs. They're in each "c"'s Job Details.
For example Windows 10 x64 QuantumRender Shippable opt profiles:

I haven't run a parallel try to verify that the talos regression is still there, but I run the same patches. Do you see anything related to this perf issue in those profiles?
If not, I'll redo it rebased on today's master with talos runs to confirm the regression and new profiles.

Smaug suggested to introduce a Destroy method in Localization class which erases mBundles not just on UNLINK and in the destructor, but also from Document::Destroy which happens earlier.
This erased the regression!

master vs. this patchset: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9e20666939f85da8dacc2990bceb34e9fb524859&newProject=try&newRevision=aefe0311d986dfd6b8b4c888ed2147ec3c8c3c42&framework=1

master vs. this patchset + skip non-proto: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9e20666939f85da8dacc2990bceb34e9fb524859&newProject=try&newRevision=c263bcb82dd0241f12204fe3fe18a358556d9aa8&framework=1

The latter is just adding one more change that emulates perf we'd get if we were able to fix bug 1630743. Here's the patch https://hg.mozilla.org/try/rev/8ac6c6e8bf52c3f06cb8cf72af3dbf3cf5495d7d

In other words - the former is the perf if we land this. The latter is the perf if we land this and solve how to actually make us load the JSM lazy in browser.xhtml.

Flags: needinfo?(bugs)
Pushed by zbraniecki@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/90972e768c33 Remove unused observers from JSM. r=stas https://hg.mozilla.org/integration/autoland/rev/7d5a3d1035dc Move resourceIds to Localization C++. r=stas,jfkthame https://hg.mozilla.org/integration/autoland/rev/ee3255e55ab3 Move aIsSync to Localization C++. r=stas,jfkthame https://hg.mozilla.org/integration/autoland/rev/8ade55be1042 Move generateBundles to be stored on Localization C++. r=smaug https://hg.mozilla.org/integration/autoland/rev/a20b9fb485fc Cache bundles in Localization C++. r=jfkthame,smaug https://hg.mozilla.org/integration/autoland/rev/35b4cd4bb12a Lazify JSM initialization. r=smaug

The assertion is: Assertion failure: result->mParticipant == aParticipant (nsCycleCollectionParticipant shouldn't change!)

I haven't looked at your patch, but off the top of my head, one thing that I think could cause that is that you have a cycle collected class A with a parent class B, and in B's ctor, |this| gets AddRefed. At that point, the vtable of the object will be B, so it'll call B's AddRef. But then later it'll have the vtable of A.

Attached patch crash.diffSplinter Review

I got help from :mccr8 and :smaug today on Matrix, but we ended up not being able to find the cause just yet.

I ended up dissecting my patchset until I was able to identify a small change that reliably triggers the issue.

Attached patchset can be applied on top of D71679 (Move aIsSync to Localization C++.) to reproduce the crash in debug mode with ./mach test dom/l10n/tests/mochitest/dom_localization/ --headless.

The crash does not happen on any single test, but it happens when running against the whole directory at the end:

 0:12.42 INFO TEST-START | Shutdown
 0:12.42 INFO Passed:  55
 0:12.42 INFO Failed:  0
 0:12.42 INFO Todo:    0
 0:12.42 INFO Mode:    non-e10s
 0:12.42 INFO Slowest: 1081ms - chrome://mochitests/content/chrome/dom/l10n/tests/mochitest/dom_localization/test_attr_sanitized.html
 0:12.42 INFO SimpleTest FINISHED
 0:12.42 INFO TEST-INFO | Ran 1 Loops
 0:12.42 INFO SimpleTest FINISHED
 0:12.49 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.49 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.49 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.49 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.49 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.49 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.50 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.50 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.56 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.56 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.58 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.58 GECKO(175168) [175168, Main Thread] WARNING: '!inner', file /projects/mozilla-unified/dom/ipc/JSWindowActorProtocol.cpp, line 172
 0:12.60 GECKO(175168) [175168, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111 (NS_ERROR_NOT_AVAILABLE): file /projects/mozilla-unified/netwerk/cache/nsApplicationCacheService.cpp, line 161
 0:12.76 GECKO(175168) [2020-05-22T02:16:03Z WARN  xulstore::persist] tried to remove key that isn't in the store
 0:12.76 GECKO(175168) [2020-05-22T02:16:03Z WARN  xulstore::persist] tried to remove key that isn't in the store
 0:13.49 GECKO(175168) 1590113764377	Marionette	TRACE	Received observer notification xpcom-will-shutdown
 0:13.49 GECKO(175168) 1590113764377	Marionette	INFO	Stopped listening on port 2828
 0:13.49 GECKO(175168) 1590113764378	Marionette	DEBUG	Marionette stopped listening
 0:13.59 GECKO(175168) [175168, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005 (NS_ERROR_FAILURE): file /projects/mozilla-unified/js/xpconnect/loader/mozJSComponentLoader.cpp, line 228
 0:13.59 GECKO(175168) [175168, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005 (NS_ERROR_FAILURE): file /projects/mozilla-unified/js/xpconnect/loader/mozJSComponentLoader.cpp, line 255
 0:14.88 GECKO(175168) Assertion failure: result->mParticipant == aParticipant (nsCycleCollectionParticipant shouldn't change!), at /projects/mozilla-unified/xpcom/base/nsCycleCollector.cpp:2014
Initializing stack-fixing for the first stack frame, this may take a while...
 0:34.36 GECKO(175168) #01: CCGraphBuilder::AddPurpleRoot(void*, nsCycleCollectionParticipant*) (/projects/mozilla-unified/xpcom/base/nsCycleCollector.cpp:2028)
 0:34.36 GECKO(175168) #02: SelectPointersVisitor::Visit(nsPurpleBuffer&, nsPurpleBufferEntry*) (/projects/mozilla-unified/xpcom/base/nsCycleCollector.cpp:1053)
 0:34.36 GECKO(175168) #03: void nsPurpleBuffer::VisitEntries<SelectPointersVisitor>(SelectPointersVisitor&) (/projects/mozilla-unified/xpcom/base/nsCycleCollector.cpp:957)
 0:34.36 GECKO(175168) #04: nsCycleCollector::BeginCollection(ccType, nsICycleCollectorListener*) (/projects/mozilla-unified/xpcom/base/nsCycleCollector.cpp:1068)
 0:34.36 GECKO(175168) #05: nsCycleCollector::Collect(ccType, js::SliceBudget&, nsICycleCollectorListener*, bool) (/projects/mozilla-unified/xpcom/base/nsCycleCollector.cpp:3440)
 0:34.36 GECKO(175168) #06: nsCycleCollector::ShutdownCollect() (/projects/mozilla-unified/xpcom/base/nsCycleCollector.cpp:3352)
 0:34.36 GECKO(175168) #07: nsCycleCollector::Shutdown(bool) (/projects/mozilla-unified/xpcom/base/nsCycleCollector.cpp:3644)
 0:34.36 GECKO(175168) #08: nsCycleCollector_shutdown(bool) (/projects/mozilla-unified/obj-x86_64-pc-linux-gnu-dbg/dist/include/mozilla/RefPtr.h:66)
 0:34.36 GECKO(175168) #09: mozilla::ShutdownXPCOM(nsIServiceManager*) (/projects/mozilla-unified/xpcom/build/XPCOMInit.cpp:719)
 0:34.36 GECKO(175168) #10: ScopedXPCOMStartup::~ScopedXPCOMStartup() (/projects/mozilla-unified/toolkit/xre/nsAppRunner.cpp:1278)
 0:34.36 GECKO(175168) #11: XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (/projects/mozilla-unified/obj-x86_64-pc-linux-gnu-dbg/dist/include/mozilla/cxxalloc.h:51)
 0:34.37 GECKO(175168) #12: XRE_main(int, char**, mozilla::BootstrapConfig const&) (/projects/mozilla-unified/toolkit/xre/nsAppRunner.cpp:4807)
 0:34.48 GECKO(175168) #13: main (/projects/mozilla-unified/browser/app/nsBrowserApp.cpp:0)
 0:34.48 GECKO(175168) #14: __libc_start_main (/usr/lib/libc.so.6 + 0x27002)
 0:34.48 GECKO(175168) #15: _start (/projects/mozilla-unified/obj-x86_64-pc-linux-gnu-dbg/dist/bin/firefox + 0x3902e)
 0:34.48 GECKO(175168) #16: ??? (???:???)
 0:34.48 GECKO(175168) ExceptionHandler::GenerateDump cloned child 175396
 0:34.48 GECKO(175168) ExceptionHandler::SendContinueSignalToChild sent continue signal to child
 0:34.48 GECKO(175168) ExceptionHandler::WaitForContinueSignal waiting for continue signal...
 0:34.49 Main app process: exit 11
 0:34.49 INFO Buffered messages finished
 0:34.49 ERROR TEST-UNEXPECTED-FAIL | Last test finished | application terminated with exit code 11
 0:34.49 INFO runtests.py | Application ran for: 0:00:32.086903
 0:34.49 INFO zombiecheck | Reading PID log: /tmp/tmpkIOg41pidlog
 0:34.49 INFO ==> process 175168 launched child process 175179
 0:34.50 INFO zombiecheck | Checking for orphan process with PID: 175179
 0:34.50 CRASH: pid:175063. Test:Last test finished. Minidump analysed:False. Signature:[None]
Mozilla crash reason: MOZ_ASSERT(result->mParticipant == aParticipant) (nsCycleCollectionParticipant shouldn't change!)
Crash dump filename: /tmp/tmpX4VMLo.mozrunner/minidumps/54ed273b-aa3c-e3ae-58a1-e42f99a53513.dmp
MINIDUMP_STACKWALK not set, can't process dump.

 0:34.50 INFO Stopping web server
 0:34.50 INFO Stopping web socket server
 0:34.52 INFO Stopping ssltunnel
 0:34.54 INFO leakcheck | Processing log file /tmp/tmpX4VMLo.mozrunner/runtests_leaks.log
 0:34.54 INFO TEST-INFO | leakcheck | default process: leak threshold set at 0 bytes
 0:34.54 INFO TEST-INFO | leakcheck | forkserver process: leak threshold set at 0 bytes
 0:34.54 INFO TEST-INFO | leakcheck | gmplugin process: leak threshold set at 20000 bytes
 0:34.54 INFO TEST-INFO | leakcheck | gpu process: leak threshold set at 0 bytes
 0:34.54 INFO TEST-INFO | leakcheck | plugin process: leak threshold set at 0 bytes
 0:34.54 INFO TEST-INFO | leakcheck | rdd process: leak threshold set at 400 bytes
 0:34.54 INFO TEST-INFO | leakcheck | socket process: leak threshold set at 0 bytes
 0:34.54 INFO TEST-INFO | leakcheck | tab process: leak threshold set at 0 bytes
 0:34.54 INFO TEST-INFO | leakcheck | vr process: leak threshold set at 0 bytes
 0:34.55 INFO leakcheck | Processing leak log file /tmp/tmpX4VMLo.mozrunner/runtests_leaks.log
 0:34.55 FAIL leakcheck: default missing output line for total leaks!

 0:34.55 INFO runtests.py | Running tests: end.
 0:34.55 INFO Buffered messages finished
0 INFO TEST-START | Shutdown
1 INFO Passed:  55
2 INFO Failed:  0
3 INFO Todo:    0
4 INFO Mode:    non-e10s
5 INFO SimpleTest FINISHED
 0:34.55 INFO Buffered messages finished
 0:34.55 SUITE_END
 0:34.55 
Overall Summary
===============

mochitest-chrome
~~~~~~~~~~~~~~~~
Ran 89 checks (17 asserts, 55 subtests, 17 tests)
Expected results: 89
Unexpected results: 0

Unexpected Results
------------------
ERROR TEST-UNEXPECTED-FAIL | Last test finished | application terminated with exit code 11

The exact point of when the crash become reproducable are the changes in Localization::Activate. In other words, just adding mGenerateBundles and mGenerateBundlesSync to the class and hooking them into CC doesn't trigger it.
But populating them in Activate triggers the crash.

I don't know what it means exactly and how to address it. I suspect some changes are required for DOMLocalization and/or DocumentL10n to handle those fields in them, but looking at other examples of class inheritance with fields doesn't help and I couldn't find another class with JS::Heap<JS::Value> field that extends another class.

I hope that with that minimized testcase :mccr8 and :smaug will be able to help me.

Flags: needinfo?(gandalf)
Flags: needinfo?(continuation)
Flags: needinfo?(bugs)

(In reply to Andrew McCreight [:mccr8] from comment #22)

I haven't looked at your patch, but off the top of my head, one thing that I think could cause that is that you have a cycle collected class A with a parent class B, and in B's ctor, |this| gets AddRefed. At that point, the vtable of the object will be B, so it'll call B's AddRef. But then later it'll have the vtable of A.

Isn't the problem the call to |mozilla::HoldJSObjects(this)| in the Localization ctor? Seems to me like the QI that HoldJSObjects does to get the CC participant will not return the right participant at that point if the this object is a DOMLocalization (which inherits from Localization). Not really sure why we can't just call HoldJSObjects when we actually get JS objects to hold in Localization::Activate instead.

Good catch. I forgot that that QI is done eagerly. I'd guess that there aren't any other non-wrapper-cached nsISupports JS-holding CCed classes. In general, I think it is nice to unconditionally call hold in the ctor so you don't forget it, but it seems like a reasonable workaround here.

Thank you Peter! That was it!

I moved the HoldJSObjects to Activate and them moved RegisterObservers to Init.

I kept the bool Init in Localization so that I don't have to mess with signatures when overriding it in DocumentL10n. Hope that's ok.

Flags: needinfo?(continuation)
Flags: needinfo?(bugs)

Here's try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3546a98004283bc37b5ef57cdea500d304d478fe

Changes since May 15th:

  • Switched Localization to use NS_IMPL_CYCLE_COLLECTION_MULTI_ZONE_JSHOLDER_CLASS for mBundles
  • Moved mozilla::HoldJSObjects(this); to Localization::Activate
  • Created Localization::Init and Localization::Create and moved RegisterObservers there

If I understand correctly, I need the MULTI_ZONE here because mBundles is an iterator over bundles that are referencing FluentResource that is assigned to L10nRegistry (singleton per process) and can be used in multiple documents.

I see two new regressions:

  • browser/base/content/test/performance/browser_windowopen.js - started looking into it and it's coming up early. In D71677. I suspect something about difference in triggering onChange in result of Jonathan's feedback.
  • browser/components/syncedtabs/test/browser/browser_sidebar_syncedtabslist.js - seems like we changed the promise rejection in some place. Will investigate more tomorrow.
Attached file onchange.diff (obsolete) —

A new regression, that I again need help with. :(

Running the anti-flicker check test in debug mode with this little patch that should be a no-op, leads to flicker regressions.

All the patch is changing is moving the call to onChange from happening in 4 add/removeResourceId(s) calls in JSM, to the same 4 calls in C++ (which call JSM's onChange).
This should be a complete no-op. Nothing else calls those JSM functions as far as I understand, since they're not exposed (nothing uses mozILocalization except of Localization.cpp) and it should work exactly the same. But it isn't, it seems to cause flickers.

To reproduce, apply this patch on top of central and run:

./mach test browser/base/content/test/performance/browser_windowopen.js --headless

The current result:

Unexpected Results
------------------
browser/base/content/test/performance/browser_windowopen.js
  FAIL unexpected changed rect: ({x1:325, x2:475, y1:47, y2:60, w:151, h:14}), window width: 1366 - 
Stack trace:
chrome://mochikit/content/browser-test.js:test_ok:1299
chrome://mochitests/content/browser/browser/base/content/test/performance/head.js:reportUnexpectedFlicker/rects<:611
chrome://mochitests/content/browser/browser/base/content/test/performance/head.js:reportUnexpectedFlicker:602
chrome://mochitests/content/browser/browser/base/content/test/performance/head.js:withPerfObserver:671
chrome://mochitests/content/browser/browser/base/content/test/performance/browser_windowopen.js:null:106
chrome://mochikit/content/browser-test.js:Tester_execTest/<:1064
chrome://mochikit/content/browser-test.js:Tester_execTest:1104
chrome://mochikit/content/browser-test.js:nextTest/<:927
chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:SimpleTest.waitForFocus/waitForFocusInner/focusedOrLoaded/<:918
  FAIL unexpected changed rect: ({x1:481, x2:537, y1:47, y2:57, w:57, h:11}), window width: 1366 - 
Stack trace:
chrome://mochikit/content/browser-test.js:test_ok:1299
chrome://mochitests/content/browser/browser/base/content/test/performance/head.js:reportUnexpectedFlicker/rects<:611
chrome://mochitests/content/browser/browser/base/content/test/performance/head.js:reportUnexpectedFlicker:602
chrome://mochitests/content/browser/browser/base/content/test/performance/head.js:withPerfObserver:671
chrome://mochitests/content/browser/browser/base/content/test/performance/browser_windowopen.js:null:106
chrome://mochikit/content/browser-test.js:Tester_execTest/<:1064
chrome://mochikit/content/browser-test.js:Tester_execTest:1104
chrome://mochikit/content/browser-test.js:nextTest/<:927
chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:SimpleTest.waitForFocus/waitForFocusInner/focusedOrLoaded/<:918
  FAIL should have 0 unknown flickering areas - Got 2, expected 0
Stack trace:
chrome://mochikit/content/browser-test.js:test_is:1327
chrome://mochitests/content/browser/browser/base/content/test/performance/head.js:reportUnexpectedFlicker:628
chrome://mochitests/content/browser/browser/base/content/test/performance/head.js:withPerfObserver:671
chrome://mochitests/content/browser/browser/base/content/test/performance/browser_windowopen.js:null:106
chrome://mochikit/content/browser-test.js:Tester_execTest/<:1064
chrome://mochikit/content/browser-test.js:Tester_execTest:1104
chrome://mochikit/content/browser-test.js:nextTest/<:927
chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:SimpleTest.waitForFocus/waitForFocusInner/focusedOrLoaded/<:918

I suspected that maybe calling JSM twice has something to do with it, but a broader test where I moved resourceIds to CPP and called JSM only via onChange (so, one call per add/removeResourceId(s), kept reproducing this regression.

I spent 2 days now reproducing different variants of the same regression with minimized testcases and I'm a bit out of ideas on what may be causing it and would like to ask for some sanity check on how this can have any effect on racing, timing, and in result flickering.
Hoping that Peter or Olli will notice where am I making a mistake.

Flags: needinfo?(peterv)
Flags: needinfo?(bugs)
Attached patch onchange.diffSplinter Review

Actually, even a smaller, two line patch, reproduces it!

Attachment #9152281 - Attachment is obsolete: true

The second orange also seems to be related to the OnChange switch. It only starts being reproducible after D71815 is applied, but the stack is:

0:06.32 GECKO(199110) [Parent 199110, Main Thread] ###!!! ASSERTION: Want to fire DOMNodeRemoved event, but it's not safe: 'Error', file /projects/mozilla-unified/dom/base/nsContentUtils.cpp, line 4469
Initializing stack-fixing for the first stack frame, this may take a while...
 0:07.83 INFO runtests.py | Waiting for browser...
 0:26.11 GECKO(199110) #01: nsINode::RemoveChild(nsINode&, mozilla::ErrorResult&) (/projects/mozilla-unified/dom/base/nsINode.cpp:777)
 0:26.12 GECKO(199110) #02: mozilla::dom::L10nOverlays::GetNodeForNamedElement(mozilla::dom::Element*, mozilla::dom::Element*, nsTArray<mozilla::dom::L10nOverlaysError>&, mozilla::ErrorResult&) (/projects/mozilla-unified/dom/l10n/L10nOverlays.cpp:279)
 0:26.12 GECKO(199110) #03: mozilla::dom::L10nOverlays::OverlayChildNodes(mozilla::dom::DocumentFragment*, mozilla::dom::Element*, nsTArray<mozilla::dom::L10nOverlaysError>&, mozilla::ErrorResult&) (/projects/mozilla-unified/obj-x86_64-pc-linux-gnu-dbg/dist/include/mozilla/AlreadyAddRefed.h:146)
 0:26.12 GECKO(199110) #04: mozilla::dom::L10nOverlays::TranslateElement(mozilla::dom::Element&, mozilla::dom::L10nMessage const&, nsTArray<mozilla::dom::L10nOverlaysError>&, mozilla::ErrorResult&) (/projects/mozilla-unified/obj-x86_64-pc-linux-gnu-dbg/dist/include/mozilla/ErrorResult.h:434)
 0:26.12 GECKO(199110) #05: mozilla::dom::DOMLocalization::ApplyTranslations(nsTArray<nsCOMPtr<mozilla::dom::Element> >&, nsTArray<mozilla::dom::Nullable<mozilla::dom::L10nMessage> >&, nsXULPrototypeDocument*, mozilla::ErrorResult&) (/projects/mozilla-unified/obj-x86_64-pc-linux-gnu-dbg/dist/include/mozilla/ErrorResult.h:434)
 0:26.13 GECKO(199110) #06: mozilla::dom::DOMLocalization::TranslateElements(mozilla::dom::Sequence<mozilla::OwningNonNull<mozilla::dom::Element> > const&, nsXULPrototypeDocument*, mozilla::ErrorResult&) (/projects/mozilla-unified/dom/l10n/DOMLocalization.cpp:324)
 0:26.13 GECKO(199110) #07: mozilla::dom::DOMLocalization::TranslateFragment(nsINode&, mozilla::ErrorResult&) (/projects/mozilla-unified/dom/l10n/DOMLocalization.cpp:155)
 0:26.13 GECKO(199110) #08: mozilla::dom::DOMLocalization::TranslateRoots(mozilla::ErrorResult&) (/projects/mozilla-unified/obj-x86_64-pc-linux-gnu-dbg/dist/include/mozilla/AlreadyAddRefed.h:146)
 0:26.13 GECKO(199110) #09: mozilla::dom::DOMLocalization::OnChange() (/projects/mozilla-unified/obj-x86_64-pc-linux-gnu-dbg/dist/include/mozilla/AlreadyAddRefed.h:146)
 0:26.13 GECKO(199110) #10: mozilla::intl::Localization::AddResourceId(nsTSubstring<char16_t> const&) (/projects/mozilla-unified/intl/l10n/Localization.cpp:184)
 0:26.13 GECKO(199110) #11: mozilla::dom::Document::LocalizationLinkAdded(mozilla::dom::Element*) (/projects/mozilla-unified/dom/base/Document.cpp:0)
 0:26.13 GECKO(199110) #12: mozilla::dom::HTMLLinkElement::BindToTree(mozilla::dom::BindContext&, nsINode&) (/projects/mozilla-unified/dom/html/HTMLLinkElement.cpp:134)
 0:26.13 GECKO(199110) #13: nsINode::InsertChildBefore(nsIContent*, nsIContent*, bool) (/projects/mozilla-unified/dom/base/nsINode.cpp:1520)
 0:26.13 GECKO(199110) #14: nsINode::ReplaceOrInsertBefore(bool, nsINode*, nsINode*, mozilla::ErrorResult&) (/projects/mozilla-unified/dom/base/nsINode.cpp:2662)
 0:26.13 GECKO(199110) #15: mozilla::dom::Node_Binding::appendChild(JSContext*, JS::Handle<JSObject*>, void*, JSJitMethodCallArgs const&) (/projects/mozilla-unified/obj-x86_64-pc-linux-gnu-dbg/dom/bindings/NodeBinding.cpp:990)
 0:26.13 GECKO(199110) #16: bool mozilla::dom::binding_detail::GenericMethod<mozilla::dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail::ThrowExceptions>(JSContext*, unsigned int, JS::Value*) (/projects/mozilla-unified/dom/bindings/BindingUtils.cpp:3194)
 0:26.14 GECKO(199110) #17: CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), js::CallReason, JS::CallArgs const&) (/projects/mozilla-unified/js/src/vm/Interpreter.cpp:493)
 0:26.14 GECKO(199110) #18: js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) (/projects/mozilla-unified/js/src/vm/Interpreter.cpp:585)
 0:26.14 GECKO(199110) #19: js::jit::DoCallFallback(JSContext*, js::jit::BaselineFrame*, js::jit::ICCall_Fallback*, unsigned int, JS::Value*, JS::MutableHandle<JS::Value>) (/projects/mozilla-unified/js/src/jit/BaselineIC.cpp:2990)
 0:26.14 GECKO(199110) #20: ??? (???:???)

So my hope is that both oranges are the same issue - the change to OnChange which can be reproduced with the minimal testacase, makes us fire the translation via JSM's OnChange at a different moment. The first failure in browser_windowopen.js is reproducible with the minimized two-line testcase,
the latter requires the patch-stack but shows that with the patchset we fire OnChange at the time when we shouldn't.

I don't understand how the two-liner causes such a drastic change in the even loop.

That assertion shows a rather bad issue. Something trying to modify DOM inside BindToTree. Need to move the code doing the modification to happen later, possibly using AddScriptRunner.

Flags: needinfo?(bugs)

That assertion shows a rather bad issue. Something trying to modify DOM inside BindToTree. Need to move the code doing the modification to happen later, possibly using AddScriptRunner.

I understand that, but can you tell me why this two-line patch https://bugzilla.mozilla.org/attachment.cgi?id=9152286 applied on master causes that?
It's very confusing to me because I don't think it should change anything. And if it changes some timing, then I don't understand why, which makes it very hard to work with this code.

Why did we not need the AddScriptRunner before? Is it possible that this patch regresses that, or is it more likely that the bug existed before but with this change it makes it be reported?
I'm afraid of playing with event loop and the microtasks between JS and CPP not understanding why my changes affect that.

Flags: needinfo?(bugs)

I found it. Both bugs were caused by the OnChange change which made me actually call DOMLocalization::OnChange rather than Localization::OnChange from Add/RemoveResourceId(s).

Switching OnChange to Localization::OnChange fixed both.

I'm sorry for bothering you Peter and Olli so much, I was assuming it's some unknown in the microtask system around DOM and XPCOM, but it's just my subclassing. I triggered new try https://treeherder.mozilla.org/#/jobs?repo=try&revision=c87c07a228e6a4b1ebf521293a889c231815d82f

If it comes green, I'll be able to land.

Thank you!

Flags: needinfo?(peterv)
Flags: needinfo?(bugs)

The remaining orange:

TEST-UNEXPECTED-FAIL | browser/components/syncedtabs/test/browser/browser_sidebar_syncedtabslist.js | A promise chain failed to handle a rejection: Encountered unsupported value type writing stack-scoped structured clone - stack: (No stack available.)

was caused by us calling Destroy on DocumentL10n which set mBundles to undefined, but didn't cancel the mutation observer, so mutations that happened after the document started being destroyed were triggering localization with empty bundles and since I didn't think its possible I removed the if (!bundles) check in formatWithFallback and the error was from for async (let bundle of bundles) on undefined which cannot produces an error that is not supported.

I added that check back so that the error is a string and added DOMLocalization::Destroy which removed the mutation observer.

Pushed by zbraniecki@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ea777b1a9380 Remove unused observers from JSM. r=stas https://hg.mozilla.org/integration/autoland/rev/90bde53721ac Move resourceIds to Localization C++. r=stas,jfkthame https://hg.mozilla.org/integration/autoland/rev/77ff0aee23f6 Move aIsSync to Localization C++. r=stas,jfkthame https://hg.mozilla.org/integration/autoland/rev/7fbc8e2c4a4a Move generateBundles to be stored on Localization C++. r=smaug https://hg.mozilla.org/integration/autoland/rev/74b6fa2a1409 Cache bundles in Localization C++. r=jfkthame,smaug https://hg.mozilla.org/integration/autoland/rev/f176af77900a Lazify JSM initialization. r=smaug

The issue is that aboutNetError.js uses async localization to translate some of the strings, and those may end up applying the translations later.

It doesn't affect visuals, because the strings are hidden by default, but it may affect the tests because the tests don't wait for the async functions to fire the AboutNetErrorLoad event.
Some of those async calls can be avoided and I filed bug 1642725 for that, but to unblock this patchset, I'd like to make the tests wait for l10n to be applied just like we do in L10n tests - https://searchfox.org/mozilla-central/search?q=verifyL10n&case=true&path=

Flags: needinfo?(gandalf)
Attachment #9153686 - Attachment description: Bug 1631593 - Wait for async localization before testing the value in aboutCertError_exception test. → Bug 1631593 - Wait for async localization in aboutNetError tests.
Pushed by zbraniecki@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/af4747c3599d Remove unused observers from JSM. r=stas https://hg.mozilla.org/integration/autoland/rev/b2be75c1d585 Move resourceIds to Localization C++. r=stas,jfkthame https://hg.mozilla.org/integration/autoland/rev/c2e9a178e0a3 Move aIsSync to Localization C++. r=stas,jfkthame https://hg.mozilla.org/integration/autoland/rev/6a395d69e5d2 Move generateBundles to be stored on Localization C++. r=smaug https://hg.mozilla.org/integration/autoland/rev/de37ec69840f Cache bundles in Localization C++. r=jfkthame,smaug https://hg.mozilla.org/integration/autoland/rev/4697d1ca4081 Lazify JSM initialization. r=smaug https://hg.mozilla.org/integration/autoland/rev/186da53817cc Wait for async localization in aboutNetError tests. r=johannh
See Also: → 1639354
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: