Closed Bug 1615265 Opened 1 year ago Closed 8 months ago

Perma ThreadSanitizer: data race /builds/worker/workspace/build/src/js/xpconnect/loader/ScriptPreloader.cpp:930:17 in mozilla::ScriptPreloader::OffThreadDecodeCallback(JS::OffThreadToken*, void*) when Gecko 75 merges to Beta on 2020-03-09

Categories

(Core :: XPConnect, defect, P3)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr68 --- unaffected
firefox73 --- unaffected
firefox74 --- unaffected
firefox75 + verified

People

(Reporter: noemi_erli, Assigned: kmag)

References

(Blocks 1 open bug)

Details

(Keywords: regression)

Attachments

(3 files, 1 obsolete file)

[Tracking Requested - why for this release]:

Central as beta: https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=048c63ff0b3cb2c6aa74490654b411862b07e03d&selectedJob=288692316

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288692316&repo=try&lineNumber=2196

[task 2020-02-13T11:20:50.946Z] 11:20:50 INFO - Buffered messages finished
[task 2020-02-13T11:20:50.947Z] 11:20:50 ERROR - TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code -6
[task 2020-02-13T11:20:50.948Z] 11:20:50 INFO - runtests.py | Application ran for: 0:03:00.081722
[task 2020-02-13T11:20:50.948Z] 11:20:50 INFO - zombiecheck | Reading PID log: /tmp/tmpbS_28opidlog
[task 2020-02-13T11:20:50.949Z] 11:20:50 INFO - ==> process 1949 launched child process 1966
[task 2020-02-13T11:20:50.950Z] 11:20:50 INFO - zombiecheck | Checking for orphan process with PID: 1966
[task 2020-02-13T11:20:50.950Z] 11:20:50 INFO - Traceback (most recent call last):
[task 2020-02-13T11:20:50.951Z] 11:20:50 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2852, in doTests
[task 2020-02-13T11:20:50.951Z] 11:20:50 INFO - e10s=options.e10s
[task 2020-02-13T11:20:50.952Z] 11:20:50 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2371, in runApp
[task 2020-02-13T11:20:50.952Z] 11:20:50 INFO - raise exc(value).with_traceback(tb)
[task 2020-02-13T11:20:50.953Z] 11:20:50 INFO - AttributeError: 'timeout' object has no attribute 'with_traceback'
[task 2020-02-13T11:20:50.954Z] 11:20:50 ERROR - Automation Error: Received unexpected exception while running application
[task 2020-02-13T11:20:50.954Z] 11:20:50 ERROR -
[task 2020-02-13T11:20:50.955Z] 11:20:50 INFO - Stopping web server
[task 2020-02-13T11:20:50.962Z] 11:20:50 INFO - Stopping web socket server
[task 2020-02-13T11:20:50.984Z] 11:20:50 INFO - Stopping ssltunnel
[task 2020-02-13T11:20:51.003Z] 11:20:51 WARNING - leakcheck | refcount logging is off, so leaks can't be detected!

Summary: Perma ThreadSanitizer: data race /builds/worker/workspace/build/src/js/xpconnect/loader/ScriptPreloader.cpp:930:17 in mozilla::ScriptPreloader::OffThreadDecodeCallback(JS::OffThreadToken*, void*) → Perma ThreadSanitizer: data race /builds/worker/workspace/build/src/js/xpconnect/loader/ScriptPreloader.cpp:930:17 in mozilla::ScriptPreloader::OffThreadDecodeCallback(JS::OffThreadToken*, void*) when Gecko 75 merges to Beta on 2020-03-09
Blocks: tsan

I'm not too familiar with TSan reports, but it looks like we're accessing mToken in MaybeFinishOffThreadDecode() without holding the lock? I don't know why this would only be failing on beta.

This race sounds familiar, we likely fixed this on central in the last cycle but didn't uplift (I can't find the bug right now, can look later again). In general, there might be quite a few races that have been fixed on central only without an uplift. So we probably need a general strategy here. Maybe for now, we shouldn't run TSan on beta at all until this has settled a bit.

(In reply to Christian Holler (:decoder) from comment #2)

This race sounds familiar, we likely fixed this on central in the last cycle but didn't uplift (I can't find the bug right now, can look later again). In general, there might be quite a few races that have been fixed on central only without an uplift. So we probably need a general strategy here. Maybe for now, we shouldn't run TSan on beta at all until this has settled a bit.

Isn't this for Gecko-75-as-beta? So it should contain all of the fixes on m-c.

Bug 1608462 is in the same callback, but a different error.

See Also: → 1608462

(In reply to Andrew McCreight [:mccr8] from comment #3)

(In reply to Christian Holler (:decoder) from comment #2)

This race sounds familiar, we likely fixed this on central in the last cycle but didn't uplift (I can't find the bug right now, can look later again). In general, there might be quite a few races that have been fixed on central only without an uplift. So we probably need a general strategy here. Maybe for now, we shouldn't run TSan on beta at all until this has settled a bit.

Isn't this for Gecko-75-as-beta? So it should contain all of the fixes on m-c.

Yes, it is.

Attachment #9126938 - Attachment description: TSan Race Information → TSan Race Information (autoland 45a6cdb1ddce5b86ed5360b078fa1c696f0693cc)

This also appeared intermittently on autoland now although I am not sure it is exactly the same issue. The problem here is that one stack is consistently missing (most likely caused by a suppression).

We can suppress ScriptPreloader::OffThreadDecodeCallback to make this work on beta, but I suspect we are hiding one or more bugs here that we are not aware of. It could be masked by suppressions for some of the other outstanding JS bugs.

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression

(In reply to Christian Holler (:decoder) from comment #7)

This also appeared intermittently on autoland now although I am not sure it is exactly the same issue. The problem here is that one stack is consistently missing (most likely caused by a suppression).

We can suppress ScriptPreloader::OffThreadDecodeCallback to make this work on beta, but I suspect we are hiding one or more bugs here that we are not aware of. It could be masked by suppressions for some of the other outstanding JS bugs.

I've used this for classifying on autoland too. Should we file another bug for that?

Flags: needinfo?(choller)

No, I can add the suppression in this bug. But I am afraid of adding such a generic suppression and if we can resolve this bug in time, I would prefer doing that.

Kris, I've looked into the one stack we have here and this is part of the code involved in the race:

  if (cache->mToken && !cache->mFinishDecodeRunnablePending) {
    cache->mFinishDecodeRunnablePending = true; // <--- Race Report here
    NS_DispatchToMainThread(
        NewRunnableMethod("ScriptPreloader::DoFinishOffThreadDecode", cache,
                          &ScriptPreloader::DoFinishOffThreadDecode));
  }

The only other location where we write to mFinishDecodeRunnablePending is in ScriptPreloader::DoFinishOffThreadDecode() where we set it to false and that method is not guarded by any locks.

When those two methods are racing, a runnable can be dispatched and mFinishDecodeRunnablePending be immediately set to false afterwards, which is probably undesired.

Flags: needinfo?(choller) → needinfo?(kmaglione+bmo)
Assignee: nobody → choller
Status: NEW → ASSIGNED
Pushed by choller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0f1a8e4c6a76
Suppress data race in OffThreadDecodeCallback. r=nbp

Christian,
Should we mark this as verify fixed since this is not happening anymore in beta simulations?

Flags: needinfo?(choller)
Status: NEW → ASSIGNED

kmag says he will make the racing field atomic. The race should be harmless because it is just an optimization.

P3

Fission Milestone: ? → ---
Priority: -- → P3

Tentatively assigning to kmag because he said he will make the racing field atomic.

Assignee: nobody → kmaglione+bmo

The leave-open keyword is there and there is no activity for 6 months.
:peterv, maybe it's time to close this bug?

Flags: needinfo?(peterv)
Flags: needinfo?(peterv)
Pushed by abeingessner@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6e297cea9592
Remove supression for seemingly fixed issue. r=decoder

oops, didn't need to be leave-open anymore

Status: NEW → RESOLVED
Closed: 8 months ago
Flags: needinfo?(kmaglione+bmo)
Resolution: --- → FIXED
Pushed by abeingessner@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e251b635944c
Backed out changeset 6e297cea9592 and applied proper change. r=decoder
Attachment #9184020 - Attachment is obsolete: true
See Also: → 1677913
You need to log in before you can comment on or make changes to this bug.