1532775 - windows/aarch64 - browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | uncaught exception - NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE)

Treeherder Bug Filer

Reporter

Description

•

6 years ago

treeherder

#[markdown(off)]
Filed by: egao [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=231972516&repo=try

https://queue.taskcluster.net/v1/task/eFQBBUKGQK6UbkWQWapknw/runs/0/artifacts/public/logs/live_backing.log

Observed during windows10-aarch64/opt-mochitest-chrome-1.

Context (error repeats four times):
20:26:27 INFO - TEST-UNEXPECTED-FAIL | browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | uncaught exception - NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsINavHistoryContainerResultNode.getChild] at PTV__getNodeForRow@chrome://browser/content/places/treeView.js:257:36
20:26:27 INFO - PTV_getCellText@chrome://browser/content/places/treeView.js:1411:21
20:26:27 INFO -
20:26:27 INFO - simpletestOnerror@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:1665:24
20:26:27 INFO - OnErrorEventHandlerNonNull*@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:1645:1
20:26:27 INFO - GECKO(6020) | JavaScript error: chrome://browser/content/places/treeView.js, line 257: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsINavHistoryContainerResultNode.getChild]
20:26:27 INFO - Not taking screenshot here: see the one that was previously logged

Edwin Takahashi (:egao | infrequent contributor)

Updated

•

6 years ago

Keywords: regression

Edwin Takahashi (:egao | infrequent contributor)

Comment 1

•

6 years ago

Following the NS_ERROR_ILLEGAL_VALUE occurrences this is also repeated four times:

20:26:27     INFO - TEST-UNEXPECTED-ERROR | browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | called finish() multiple times
20:26:27     INFO - TEST-INFO took 192ms
20:26:27    ERROR - [SimpleTest.finish()] this test already called finish!
20:26:27     INFO - TEST-UNEXPECTED-ERROR | browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | called finish() multiple times
20:26:27     INFO - TEST-INFO 
20:26:27    ERROR - [SimpleTest.finish()] this test already called finish!
20:26:27     INFO - TEST-UNEXPECTED-ERROR | browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | called finish() multiple times
20:26:27     INFO - TEST-INFO 
20:26:27    ERROR - [SimpleTest.finish()] this test already called finish!
20:26:27     INFO - TEST-UNEXPECTED-ERROR | browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | called finish() multiple times

Comment hidden (Intermittent Failures Robot)

Edwin Takahashi (:egao | infrequent contributor)

Comment 4

•

6 years ago

Test disabled for windows10-aarch64 in https://phabricator.services.mozilla.com/D24309 due to consistent failure.

Edwin Takahashi (:egao | infrequent contributor)

Updated

•

6 years ago

Blocks: win10-aarch64-disabled-mochitest

Mark Banner (:standard8)

Comment 5

•

6 years ago

Edwin, shouldn't these also block bug 1515519 so that they get fixed or at least investigated before we ship aarch64? For instance, I don't have an obvious idea why this one test would fail, but the other tests would pass.

Who's on point for investigating and fixing these?

Flags: needinfo?(egao)

Joel Maher ( :jmaher ) (UTC -8)

Comment 6

•

6 years ago

In short we have 100+ tests to green up and make sure we don't over disable- we do count on test owners to chime in. When there are obvious cases of a test that should pass or we need more information like this bug, lets bring in more people.

:dmajor - is this something you can look at?

Flags: needinfo?(egao) → needinfo?(dmajor)

(Away)

Comment 7

•

6 years ago

I can reproduce the NS_ERROR_ILLEGAL_VALUE and called finish() multiple times on my arm machine but I haven't yet had time to debug.

(Away)

Comment 8

•

6 years ago

I put some debugger logging in nsNavHistoryContainerResultNode::GetChild.

On a working x86_64 build, I got:

GetChild: this=1d486dffd80, aIndex=0 nChildren=3  ->OK
GetChild: this=1d486dffd80, aIndex=1 nChildren=3  ->OK
GetChild: this=1d486dffd80, aIndex=2 nChildren=3  ->OK
GetChild: this=1d486dffd80, aIndex=0 nChildren=3  ->OK
GetChild: this=1d486dffd80, aIndex=1 nChildren=3  ->OK
GetChild: this=1d486dffd80, aIndex=2 nChildren=3  ->OK
GetChild: this=1d4fc67f480, aIndex=0 nChildren=1  ->OK
GetChild: this=1d4fc67f480, aIndex=0 nChildren=1  ->OK

On the arm64 build, I got:

GetChild: this=196c926d1c0, aIndex=0 nChildren=3  ->OK
GetChild: this=196c926d1c0, aIndex=1 nChildren=3  ->OK
GetChild: this=196c926d1c0, aIndex=2 nChildren=3  ->OK
GetChild: this=196c926d1c0, aIndex=0 nChildren=3  ->OK
GetChild: this=196c926d1c0, aIndex=1 nChildren=3  ->OK
GetChild: this=196c926d1c0, aIndex=2 nChildren=3  ->OK
GetChild: this=196c926d580, aIndex=0 nChildren=1  ->OK
GetChild: this=196c926d580, aIndex=0 nChildren=1  ->OK
GetChild: this=196c926d1c0, aIndex=2 nChildren=0  ->INVALIDARG
GetChild: this=196c926d1c0, aIndex=2 nChildren=0  ->INVALIDARG
GetChild: this=196c926d1c0, aIndex=2 nChildren=0  ->INVALIDARG
GetChild: this=196c926d1c0, aIndex=1 nChildren=0  ->INVALIDARG
GetChild: this=196c926d1c0, aIndex=1 nChildren=0  ->INVALIDARG

In other words, the same pattern of successful calls, and then some bad ones. So the next question is why are we calling GetChild again on the first object after its children have been removed? This may be related to the "called finish() multiple times" problem. Are we kicking off more tasks than we should?

It's looking like C++ is the wrong level to debug this at. This could use some help from someone who can look at the browser JS to see why these calls are being made. Gijs, I believe you've worked on arm64 bugs before, would you be able to help with this one?

Flags: needinfo?(dmajor) → needinfo?(gijskruitbosch+bugs)

:Gijs (out for now; he/him)

Comment 9

•

6 years ago

(Apologies for the delay, I hope to get to this tuesday morning)

:Gijs (out for now; he/him)

Comment 10

•

6 years ago

Attached patch Patch that unbreaks my test on my arm device — Details — Splinter Review

I'm not actually getting much further with JS debugging. As far as I can tell the root cause here is basically that we have a <tree> in the document that is based on places data, and then while still rendering the tree we kill off the places data, and then things are very sad.

The simple workaround is to make sure the tree gets removed before we wipe all the places data.

That said, it's not clear why this doesn't happen everywhere. Just adding a sleep timeout either before or after clearing the DB doesn't make things break on my non-arm windows machine in the same way, so it's not a straight race condition AFAICT. Something is triggering these getCellText calls on arm that isn't doing so on other Windows machines.

The problem is that the callsites for getCellText are... in C++ land.
Marco or Mark, can you elaborate on how this is supposed to work (ie, is it just wrong we're getting the getCellText calls at all, or is something supposed to be cleaning up the treeview internals that isn't, or...?) or how to figure out what's tripping the getCellText calls?

Flags: needinfo?(standard8)

Flags: needinfo?(mak77)

Flags: needinfo?(gijskruitbosch+bugs)

Mark Banner (:standard8)

Comment 11

•

6 years ago

I think Marco is probably in a better place to answer the getCellText calls.

I've had a think around this, and I don't really understand why it is only this test. I just looked back at bug 1163447 and whilst there is some c++ code handling that, it all looks sane from the 32 vs 64 bit point of view, though I doubt that'd affect it because otherwise we'd be seeing an issue on Linux as well.

Although it might be worth checking the itemIds look sane anyway.

I think the reason we get "called finish() multiple times" is simply because the multiple throws are caught and logged. I don't think we're running the test multiple times.

Flags: needinfo?(standard8)

Marco Bonardo [:mak]

Comment 12

•

6 years ago

(In reply to :Gijs (he/him) from comment #10)

I'm not actually getting much further with JS debugging. As far as I can tell the root cause here is basically that we have a <tree> in the document that is based on places data, and then while still rendering the tree we kill off the places data, and then things are very sad.

The tree should keep its data alive, and those are in a cycle until the tree destructor runs, thus I'm not sure I follow.

The simple workaround is to make sure the tree gets removed before we wipe all the places data.

What do you mean by "wiping all the places data", are you referring to the test calling bookmarks.eraseEverything(), or to some garbage collection killing the result?

That said, it's not clear why this doesn't happen everywhere. Just adding a sleep timeout either before or after clearing the DB doesn't make things break on my non-arm windows machine in the same way, so it's not a straight race condition AFAICT. Something is triggering these getCellText calls on arm that isn't doing so on other Windows machines.

Maybe it's just a timing difference, or a problem with cycle collection, or just the order in which things are cycle collected.

eraseEverything is just removing all the bookmarks, that causes bookmark notifications, that are catched by a Places result (nsINavHistoryResult) that notifies the view (treeView.js) through (likely) a nodeRemoved() call. This invalidates the tree row, and that invalidation may invoke getCellText.
Though, we await for it, so notifications should have happened already. I wonder if other notifications fire, so it may be worth, as a first step, to check which bookmark notifications are fired and when (registering a bookmarks observer).
It's possible a notification causes others, or some notification is delayed, and thus handled after the await.

Once the tree goes away, it should break the cycle and things should start being cycle collected. Maybe on ARM we collect things in a different order? One critical step here is that when the tree is going away, we ask the result to stop observing changes, here https://searchfox.org/mozilla-central/rev/b756e6d00728dda4121f8278a744381d8643317a/browser/components/places/content/treeView.js#1468
If we don't immediately detach the result from a dying tree, then surely we may get notifications on a zombie tree.

I'd like to be able to debug this, but I don't have an arm device at hand :(

Marco Bonardo [:mak]

Comment 13

•

5 years ago

Not sure I can help further.

Flags: needinfo?(mak77)

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Bugzilla

windows/aarch64 - browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | uncaught exception - NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE)

Categories

(Firefox :: Bookmarks & History, defect, P5)

Tracking

()

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Attachment

General

Description

File Name

Content Type