Open Bug 1586105 Opened 5 months ago Updated 3 months ago

Intermittent fission tools/profiler/tests/browser/browser_test_profile_single_frame_page_info.js | This test exceeded the timeout threshold. It should be rewritten or split up. If that's not possible, use requestLongerTimeout(N), but only as a last resort

Categories

(Core :: Gecko Profiler, defect, P5)

defect

Tracking

()

Fission Milestone M4.1

People

(Reporter: intermittent-bug-filer, Assigned: julienw)

References

(Blocks 2 open bugs)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [comment 28][Comment 33][stockwell fixed:patch])

Attachments

(2 files, 1 obsolete file)

Filed by: csabou [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=269659405&repo=autoland
Full log: https://queue.taskcluster.net/v1/task/LbeYed7QQQCtber6mDqmXg/runs/0/artifacts/public/logs/live_backing.log


[task 2019-10-03T18:54:42.268Z] 18:54:42 INFO - TEST-OK | tools/profiler/tests/browser/browser_test_profile_replacestate_page_info.js | took 45421ms
[task 2019-10-03T18:54:42.288Z] 18:54:42 INFO - GECKO(10690) | ++DOCSHELL 0x7fb3f535d800 == 1 [pid = 11163] [id = {241310fd-fac5-41a8-9706-63ef55bdaaa7}]
[task 2019-10-03T18:54:42.344Z] 18:54:42 INFO - checking window state
[task 2019-10-03T18:54:42.385Z] 18:54:42 INFO - TEST-START | tools/profiler/tests/browser/browser_test_profile_single_frame_page_info.js
[task 2019-10-03T18:54:42.486Z] 18:54:42 INFO - GECKO(10690) | ++DOMWINDOW == 1 (0x7fb3f5d27f20) [pid = 11163] [serial = 1] [outer = (nil)]
[task 2019-10-03T18:54:42.501Z] 18:54:42 INFO - GECKO(10690) | ++DOMWINDOW == 2 (0x7fb3f52d9000) [pid = 11163] [serial = 2] [outer = 0x7fb3f5d27f20]
[task 2019-10-03T18:54:42.663Z] 18:54:42 INFO - GECKO(10690) | nsStringStats
[task 2019-10-03T18:54:42.664Z] 18:54:42 INFO - GECKO(10690) | => mAllocCount: 57535
[task 2019-10-03T18:54:42.664Z] 18:54:42 INFO - GECKO(10690) | => mReallocCount: 0
[task 2019-10-03T18:54:42.664Z] 18:54:42 INFO - GECKO(10690) | => mFreeCount: 57535
[task 2019-10-03T18:54:42.664Z] 18:54:42 INFO - GECKO(10690) | => mShareCount: 14544
[task 2019-10-03T18:54:42.665Z] 18:54:42 INFO - GECKO(10690) | => mAdoptCount: 325
[task 2019-10-03T18:54:42.666Z] 18:54:42 INFO - GECKO(10690) | => mAdoptFreeCount: 329
[task 2019-10-03T18:54:42.666Z] 18:54:42 INFO - GECKO(10690) | => Process ID: 11134, Thread ID: 139626598516544
[task 2019-10-03T18:54:43.003Z] 18:54:43 INFO - GECKO(10690) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /tmp/tmpXNiNGk.mozrunner/runtests_leaks_tab_pid11195.log
[task 2019-10-03T18:54:50.479Z] 18:54:50 INFO - GECKO(10690) | --DOCSHELL 0x7fa6a1e5d000 == 0 [pid = 11100] [id = {082ea190-a6a8-46c5-8b46-a3ff3be5e5cf}] [url = about:blank]
[task 2019-10-03T18:54:54.791Z] 18:54:54 INFO - GECKO(10690) | --DOMWINDOW == 1 (0x7fa6a2824f20) [pid = 11100] [serial = 1] [outer = (nil)] [url = about:blank]
[task 2019-10-03T18:54:57.527Z] 18:54:57 INFO - GECKO(10690) | Couldn't convert chrome URL: chrome://branding/locale/brand.properties
[task 2019-10-03T18:54:58.776Z] 18:54:58 INFO - GECKO(10690) | --DOMWINDOW == 0 (0x7fa6a1dd9000) [pid = 11100] [serial = 2] [outer = (nil)] [url = about:blank]
[task 2019-10-03T18:54:58.993Z] 18:54:58 INFO - GECKO(10690) | ++DOCSHELL 0x7f6367f5a800 == 1 [pid = 11195] [id = {9b92f927-a7c5-478c-b01a-66b18be4b694}]
[task 2019-10-03T18:54:59.814Z] 18:54:59 INFO - GECKO(10690) | ++DOMWINDOW == 1 (0x7f6367f3bf20) [pid = 11195] [serial = 1] [outer = (nil)]
[task 2019-10-03T18:54:59.830Z] 18:54:59 INFO - GECKO(10690) | ++DOMWINDOW == 2 (0x7f6367ed9000) [pid = 11195] [serial = 2] [outer = 0x7f6367f3bf20]
[task 2019-10-03T18:55:02.734Z] 18:55:02 INFO - GECKO(10690) | ++DOMWINDOW == 3 (0x7f63689af800) [pid = 11195] [serial = 3] [outer = 0x7f6367f3bf20]
[task 2019-10-03T18:55:22.233Z] 18:55:22 INFO - GECKO(10690) | --DOCSHELL 0x7f6367f5a800 == 0 [pid = 11195] [id = {9b92f927-a7c5-478c-b01a-66b18be4b694}] [url = http://example.com/browser/tools/profiler/tests/browser/single_frame.html]
[task 2019-10-03T18:55:48.860Z] 18:55:48 INFO - GECKO(10690) | --DOMWINDOW == 2 (0x7f6367ed9000) [pid = 11195] [serial = 2] [outer = (nil)] [url = about:blank]
[task 2019-10-03T18:55:48.961Z] 18:55:48 INFO - GECKO(10690) | --DOMWINDOW == 1 (0x7f6367f3bf20) [pid = 11195] [serial = 1] [outer = (nil)] [url = http://example.com/browser/tools/profiler/tests/browser/single_frame.html]
[task 2019-10-03T18:55:52.952Z] 18:55:52 INFO - GECKO(10690) | --DOMWINDOW == 0 (0x7f63689af800) [pid = 11195] [serial = 3] [outer = (nil)] [url = http://example.com/browser/tools/profiler/tests/browser/single_frame.html]
[task 2019-10-03T18:56:22.815Z] 18:56:22 INFO - GECKO(10690) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x7A0003,name=PProfiler::Msg_Stop) Channel closing: too late to send/recv, messages will be lost
[task 2019-10-03T18:56:22.836Z] 18:56:22 INFO - GECKO(10690) | [Child 11195, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/workspace/build/src/ipc/glue/ProtocolUtils.cpp, line 481
[task 2019-10-03T18:56:22.902Z] 18:56:22 INFO - TEST-INFO | started process screentopng
[task 2019-10-03T18:56:23.416Z] 18:56:23 INFO - TEST-INFO | screentopng: exit 0
[task 2019-10-03T18:56:23.416Z] 18:56:23 INFO - Buffered messages logged at 18:54:42
[task 2019-10-03T18:56:23.416Z] 18:56:23 INFO - Entering test bound test_profile_single_frame_page_info
[task 2019-10-03T18:56:23.417Z] 18:56:23 INFO - TEST-PASS | tools/profiler/tests/browser/browser_test_profile_single_frame_page_info.js | true == true -

Flags: needinfo?(canaltinova)

I don't really understand how the change to session history purging would influence the profiler test. It doesn't look like this test involves session history at all. It looks like some of the previous tests in the same directory use page.historyID, which sounds vaguely related, but even so - purging session history only happens through https://searchfox.org/mozilla-central/rev/01d1011ca4a460f751da030d455d35c267c3e210/toolkit/components/cleardata/ClearDataService.jsm#779, I think... My other guess would be that we're simply hitting the timeout threshold, which we could confirm by increasing that timeout threshold.

Oh, and in terms of impact, I guess because the new actor has the allFrames attribute, which the previous one didn't, it might now be getting created for subframes in one of the previous tests in this directory, which could cause GC to take longer, which might happen in this test, which might influence the timings. Or something. But my understanding is that JSWindowActors are created lazily so even this seems implausible. Hopefully Nazım has better ideas.

Flags: needinfo?(gijskruitbosch+bugs)

FWIW, I'm currently changing profiler PageInformation keys(docShellId and historyId) and registration in Bug 1583271 and Bug 1512500. Those patches are also changing this test. The patches up for a review right now and probably we are going to land them soon. But I'm not sure if that's going to fix the problem.

Also, previously we had to split those tests because they hit the timeout threshold. It's possible that we are hitting that threshold again but I'm not sure about the cause of it. From the logs, either new tab creation(BrowserTestUtils.withNewTab) or profile capturing(Services.profiler.getProfileDataAsync) is taking most of the time, but I can't tell which one from the logs. Since it's a fission failure, it might be possible that profiler is taking more time to capture the profiles from all the processes since there are more.

Just checked a few autoland treeherder test run and it looks like that test takes around 60-80 secs. My try run with those patches I mentioned takes around 8secs. I see a dramatic change but that may be just a luck. Will trigger more try runs with these patches and try to see if we are really reducing it. If so, I think we can wait until we land those patches and try to see if we continue to see this intermittent issue.

Flags: needinfo?(canaltinova)

I landed the previously mentioned patches and they didn't quite solve the issue. We also got another intermittent failure recently(See bug 1587907). I created a patch that changes the test cases in that bug. I believe it's going to solve this as well.

Depends on: 1587907

That's a big spike yesterday!
Hey gerald, could that be related to your recent changes ?
Or maybe Nazim's changes ?

Flags: needinfo?(gsquelart)
Flags: needinfo?(canaltinova)

I can't think of a recent bug of mine that could have caused this:

  1. Bug 1540340 should prevent deadlocks with IO Interposer, but I don't think it's enabled in this test(?)
  2. Bug 1579333 removed doAtLeastOnePeriodicSample(), but it was not used in this test.
  3. Bug 1584190 was just a JSON format change.

1 and 2 could still be dangerous though, so if there are no better suspects, we could try reverting or debugging them...

It'd be nice if the timeout could trigger a crash, so we may have a clue where it's stuck.

Nazim, what about your recent changes?

Flags: needinfo?(gsquelart)

I'm not sure what's going on here exactly... I landed two changes recently that can affect this.

  1. Bug 1583271 changed the IDs of page information. But it landed on October 10, this intermittent failure has started to be more frequent only 7 days ago. I don't think it's related to that.
  2. Bug 1587907 changed these tests to make it faster. But maybe it failed to do that? This change landed in October 22, and from the graph it looks like this failure started to increase on October 31(but we started to see a little bit of this on October 24).

So it might be because of my second bug. But I'm not sure what is happening from the logs. Everything seems normal, it's just taking more time. Maybe we should try requestLongerTimeout?

Flags: needinfo?(canaltinova)
Pushed by canaltinova@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/d30903688da6
Requesting the complete log to be able to debug intermittent timeout. r=julienw

After chatting with Julien, we wanted to request the complete log so we can debug those failures. That landed patch doesn't fix the problem. Adding leave-open flag to prevent closing the bug.

Keywords: leave-open
Whiteboard: [stockwell disable-recommended]

It's weird: in browser_test_profile_multi_frame_page_info it takes about 40 seconds to capture the profile, but in browser_test_profile_single_frame_page_info it takes about 1min15s. This isn't normal.
Also I think we miss just a few seconds for that test to succeeds, because we see all tests and even Leaving test bound test_profile_single_frame_page_info before the fail.

Ideally I'd like to understand why we take significantly longer to capture the profile. But we can also use requestLongerTimeout as a quick stopgap if this is needed.

I have the same problem locally.
In browser_test_profile_multi_frame_page_info it takes 120ms while in browser_test_profile_single_frame_page_info 1.7sec.
I don't have the problem if I only run browser_test_profile_single_frame_page_info, only when they're run in sequence.
I couldn't make mach test to run them in the other order to see if this was producing a different result.

I see the tests aren't pausing the profiler before capturing, should we do that to make things faster?

Gerald, should the sequence starting/capturing/stopping take a constant time, or is it possible that doing it twice results in the second capture taking more time? Maybe the internal buffer has some difference in that case?

I'll try to write down the profile we get in this capture phase and see if there's some clue there. Although I'm not optimistic because the capture itself takes time, and I doubt we'll see that in the profile. Maybe using linux perf...

Flags: needinfo?(gsquelart)
Whiteboard: [stockwell disable-recommended]

This is the profile we capture in this file:
https://perfht.ml/2WPnefi

Not sure what we can derive from it. But for sure we see it takes time, and also why do we have data for a new process (web content 6) but not for the existing one?

Should we disable this test on linux debug in the mean time?

Flags: needinfo?(felash)

Note that this is debug + fission, not just debug. I would be sad to disable it for all debug setups.

I think we can try to use SimpleTest.requestLongerTimeout(2) but I won't be able to do it before tuesday next week. Happy to review a patch here.

Flags: needinfo?(felash)

Hi Julien, since this has reached the disable-recommended queue and it's failing on fission, I disabled it on all fission, not all debug to reduce the failures on the trees. I added you to be the reviewer.

Test can be enabled on fission again when a fix is ready.

Thanks.

Whiteboard: [stockwell disable-recommended] → [comment 28]

(In reply to Julien Wajsberg [:julienw] from comment #22)

I have the same problem locally.
In browser_test_profile_multi_frame_page_info it takes 120ms while in browser_test_profile_single_frame_page_info 1.7sec.
I don't have the problem if I only run browser_test_profile_single_frame_page_info, only when they're run in sequence.
I couldn't make mach test to run them in the other order to see if this was producing a different result.

I see the tests aren't pausing the profiler before capturing, should we do that to make things faster?

Capturing the profile should block the sampler so pausing wouldn't help.
But markers would still get written, possibly causing minor blocages, pausing may be beneficial, to be tested. But then it would be the same impact after each test.

Gerald, should the sequence starting/capturing/stopping take a constant time, or is it possible that doing it twice results in the second capture taking more time? Maybe the internal buffer has some difference in that case?

I can't think why a 2nd profiler run should be slower: When it is stopped, the buffer and sampler should be destroyed, so there "should" be no effect on the next run.

I'll try to write down the profile we get in this capture phase and see if there's some clue there. Although I'm not optimistic because the capture itself takes time, and I doubt we'll see that in the profile. Maybe using linux perf...

Yes, using an external profiler can be a good way to profile the profiler!
Or as a first step we could add some logging during the capture, to verify that the slow-down is actually there and measure it.
Since we need to disable tests here for now, I'll open a follow-up bug to explore further...

Flags: needinfo?(gsquelart)
Assignee: nobody → felash
Blocks: 1543575
Pushed by jwajsberg@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a46ef03516e6
Request longer timeout for page info tests r=canaltinova
Whiteboard: [comment 28][stockwell disable-recommended] → [comment 28][Comment 33]

From the graph, I feel like the intermittent is fixed, but let's wait one more day to be sure of it.

Whiteboard: [comment 28][Comment 33][stockwell disable-recommended] → [comment 28][Comment 33][stockwell fixed:patch]

There's been 1 occurrence since then, but it seems different, and also from the messages it even looks like it would come from a code from before nazim's patch (comment 17). So it's weird.

But anyway the "big" issue has been fixed.

Keywords: leave-open

Tentatively moving all bugs whose summaries mention "Fission" (or other Fission-related keywords) but are not assigned to a Fission Milestone to the "?" triage milestone.

This will generate a lot of bugmail, so you can filter your bugmail for the following UUID and delete them en masse:

0ee3c76a-bc79-4eb2-8d12-05dc0b68e732

Fission Milestone: --- → ?
Attachment #9107693 - Attachment is obsolete: true
Fission Milestone: ? → M4.1
You need to log in before you can comment on or make changes to this bug.