Open Bug 1532775 Opened 4 years ago Updated 2 months ago

windows/aarch64 - browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | uncaught exception - NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE)

Categories

(Firefox :: Bookmarks & History, defect, P5)

defect

Tracking

()

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

#[markdown(off)]
Filed by: egao [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=231972516&repo=try

https://queue.taskcluster.net/v1/task/eFQBBUKGQK6UbkWQWapknw/runs/0/artifacts/public/logs/live_backing.log

Observed during windows10-aarch64/opt-mochitest-chrome-1.

Context (error repeats four times):
20:26:27 INFO - TEST-UNEXPECTED-FAIL | browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | uncaught exception - NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsINavHistoryContainerResultNode.getChild] at PTV__getNodeForRow@chrome://browser/content/places/treeView.js:257:36
20:26:27 INFO - PTV_getCellText@chrome://browser/content/places/treeView.js:1411:21
20:26:27 INFO -
20:26:27 INFO - simpletestOnerror@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:1665:24
20:26:27 INFO - OnErrorEventHandlerNonNull*@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:1645:1
20:26:27 INFO - GECKO(6020) | JavaScript error: chrome://browser/content/places/treeView.js, line 257: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsINavHistoryContainerResultNode.getChild]
20:26:27 INFO - Not taking screenshot here: see the one that was previously logged

Following the NS_ERROR_ILLEGAL_VALUE occurrences this is also repeated four times:

20:26:27     INFO - TEST-UNEXPECTED-ERROR | browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | called finish() multiple times
20:26:27     INFO - TEST-INFO took 192ms
20:26:27    ERROR - [SimpleTest.finish()] this test already called finish!
20:26:27     INFO - TEST-UNEXPECTED-ERROR | browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | called finish() multiple times
20:26:27     INFO - TEST-INFO 
20:26:27    ERROR - [SimpleTest.finish()] this test already called finish!
20:26:27     INFO - TEST-UNEXPECTED-ERROR | browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | called finish() multiple times
20:26:27     INFO - TEST-INFO 
20:26:27    ERROR - [SimpleTest.finish()] this test already called finish!
20:26:27     INFO - TEST-UNEXPECTED-ERROR | browser/components/places/tests/chrome/test_bug1163447_selectItems_through_shortcut.xul | called finish() multiple times

Test disabled for windows10-aarch64 in https://phabricator.services.mozilla.com/D24309 due to consistent failure.

Edwin, shouldn't these also block bug 1515519 so that they get fixed or at least investigated before we ship aarch64? For instance, I don't have an obvious idea why this one test would fail, but the other tests would pass.

Who's on point for investigating and fixing these?

Flags: needinfo?(egao)

In short we have 100+ tests to green up and make sure we don't over disable- we do count on test owners to chime in. When there are obvious cases of a test that should pass or we need more information like this bug, lets bring in more people.

:dmajor - is this something you can look at?

Flags: needinfo?(egao) → needinfo?(dmajor)

I can reproduce the NS_ERROR_ILLEGAL_VALUE and called finish() multiple times on my arm machine but I haven't yet had time to debug.

I put some debugger logging in nsNavHistoryContainerResultNode::GetChild.

On a working x86_64 build, I got:

GetChild: this=1d486dffd80, aIndex=0 nChildren=3  ->OK
GetChild: this=1d486dffd80, aIndex=1 nChildren=3  ->OK
GetChild: this=1d486dffd80, aIndex=2 nChildren=3  ->OK
GetChild: this=1d486dffd80, aIndex=0 nChildren=3  ->OK
GetChild: this=1d486dffd80, aIndex=1 nChildren=3  ->OK
GetChild: this=1d486dffd80, aIndex=2 nChildren=3  ->OK
GetChild: this=1d4fc67f480, aIndex=0 nChildren=1  ->OK
GetChild: this=1d4fc67f480, aIndex=0 nChildren=1  ->OK

On the arm64 build, I got:

GetChild: this=196c926d1c0, aIndex=0 nChildren=3  ->OK
GetChild: this=196c926d1c0, aIndex=1 nChildren=3  ->OK
GetChild: this=196c926d1c0, aIndex=2 nChildren=3  ->OK
GetChild: this=196c926d1c0, aIndex=0 nChildren=3  ->OK
GetChild: this=196c926d1c0, aIndex=1 nChildren=3  ->OK
GetChild: this=196c926d1c0, aIndex=2 nChildren=3  ->OK
GetChild: this=196c926d580, aIndex=0 nChildren=1  ->OK
GetChild: this=196c926d580, aIndex=0 nChildren=1  ->OK
GetChild: this=196c926d1c0, aIndex=2 nChildren=0  ->INVALIDARG
GetChild: this=196c926d1c0, aIndex=2 nChildren=0  ->INVALIDARG
GetChild: this=196c926d1c0, aIndex=2 nChildren=0  ->INVALIDARG
GetChild: this=196c926d1c0, aIndex=1 nChildren=0  ->INVALIDARG
GetChild: this=196c926d1c0, aIndex=1 nChildren=0  ->INVALIDARG

In other words, the same pattern of successful calls, and then some bad ones. So the next question is why are we calling GetChild again on the first object after its children have been removed? This may be related to the "called finish() multiple times" problem. Are we kicking off more tasks than we should?

It's looking like C++ is the wrong level to debug this at. This could use some help from someone who can look at the browser JS to see why these calls are being made. Gijs, I believe you've worked on arm64 bugs before, would you be able to help with this one?

Flags: needinfo?(dmajor) → needinfo?(gijskruitbosch+bugs)

(Apologies for the delay, I hope to get to this tuesday morning)

I'm not actually getting much further with JS debugging. As far as I can tell the root cause here is basically that we have a <tree> in the document that is based on places data, and then while still rendering the tree we kill off the places data, and then things are very sad.

The simple workaround is to make sure the tree gets removed before we wipe all the places data.

That said, it's not clear why this doesn't happen everywhere. Just adding a sleep timeout either before or after clearing the DB doesn't make things break on my non-arm windows machine in the same way, so it's not a straight race condition AFAICT. Something is triggering these getCellText calls on arm that isn't doing so on other Windows machines.

The problem is that the callsites for getCellText are... in C++ land.
Marco or Mark, can you elaborate on how this is supposed to work (ie, is it just wrong we're getting the getCellText calls at all, or is something supposed to be cleaning up the treeview internals that isn't, or...?) or how to figure out what's tripping the getCellText calls?

Flags: needinfo?(standard8)
Flags: needinfo?(mak77)
Flags: needinfo?(gijskruitbosch+bugs)

I think Marco is probably in a better place to answer the getCellText calls.

I've had a think around this, and I don't really understand why it is only this test. I just looked back at bug 1163447 and whilst there is some c++ code handling that, it all looks sane from the 32 vs 64 bit point of view, though I doubt that'd affect it because otherwise we'd be seeing an issue on Linux as well.

Although it might be worth checking the itemIds look sane anyway.

I think the reason we get "called finish() multiple times" is simply because the multiple throws are caught and logged. I don't think we're running the test multiple times.

Flags: needinfo?(standard8)

(In reply to :Gijs (he/him) from comment #10)

I'm not actually getting much further with JS debugging. As far as I can tell the root cause here is basically that we have a <tree> in the document that is based on places data, and then while still rendering the tree we kill off the places data, and then things are very sad.

The tree should keep its data alive, and those are in a cycle until the tree destructor runs, thus I'm not sure I follow.

The simple workaround is to make sure the tree gets removed before we wipe all the places data.

What do you mean by "wiping all the places data", are you referring to the test calling bookmarks.eraseEverything(), or to some garbage collection killing the result?

That said, it's not clear why this doesn't happen everywhere. Just adding a sleep timeout either before or after clearing the DB doesn't make things break on my non-arm windows machine in the same way, so it's not a straight race condition AFAICT. Something is triggering these getCellText calls on arm that isn't doing so on other Windows machines.

Maybe it's just a timing difference, or a problem with cycle collection, or just the order in which things are cycle collected.

eraseEverything is just removing all the bookmarks, that causes bookmark notifications, that are catched by a Places result (nsINavHistoryResult) that notifies the view (treeView.js) through (likely) a nodeRemoved() call. This invalidates the tree row, and that invalidation may invoke getCellText.
Though, we await for it, so notifications should have happened already. I wonder if other notifications fire, so it may be worth, as a first step, to check which bookmark notifications are fired and when (registering a bookmarks observer).
It's possible a notification causes others, or some notification is delayed, and thus handled after the await.

Once the tree goes away, it should break the cycle and things should start being cycle collected. Maybe on ARM we collect things in a different order? One critical step here is that when the tree is going away, we ask the result to stop observing changes, here https://searchfox.org/mozilla-central/rev/b756e6d00728dda4121f8278a744381d8643317a/browser/components/places/content/treeView.js#1468
If we don't immediately detach the result from a dying tree, then surely we may get notifications on a zombie tree.

I'd like to be able to debug this, but I don't have an arm device at hand :(

Not sure I can help further.

Flags: needinfo?(mak77)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.