Closed Bug 1291954 Opened 9 years ago Closed 7 years ago

Make SM(tsan) a tier 1 build

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla62

Tracking Flags:

Tracking

Status

firefox62

---

verified

People

(Reporter: sfink, Assigned: sfink)

References

(Blocks 1 open bug)

Details

Attachments

(8 files)

Report added tsan failures 8 years ago Steve Fink [:sfink] [:s:] 9.23 KB, patch	jseward : review+	Details \| Diff \| Splinter Review
Turn tsan into a canary 8 years ago Steve Fink [:sfink] [:s:] 1.07 KB, patch	jseward : review+	Details \| Diff \| Splinter Review
Avoid problems with timeouts when running under tsan 8 years ago Steve Fink [:sfink] [:s:] 5.63 KB, patch	jonco : review+	Details \| Diff \| Splinter Review
Skip wasm/asm.js for tsan test 8 years ago Steve Fink [:sfink] [:s:] 2.89 KB, patch	jonco : review+	Details \| Diff \| Splinter Review
Skip tests that depend on the wasm signal handler, since tsan interferes with it 8 years ago Steve Fink [:sfink] [:s:] 4.83 KB, patch	jonco : review+	Details \| Diff \| Splinter Review
Update tsan failures to current set and move SM(tsan) to tier 1 8 years ago Steve Fink [:sfink] [:s:] 3.94 KB, patch	jonco : review+	Details \| Diff \| Splinter Review
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3553 7 years ago GitHub Bugzilla PR Linker 47 bytes, text/x-github-pull-request		Details \| Review
Promote SM(tsan) to tier 1 7 years ago Steve Fink [:sfink] [:s:] 950 bytes, patch	aryx : review+	Details \| Diff \| Splinter Review

Steve Fink [:sfink] [:s:]

Assignee

Description

•

9 years ago

Updated

•

9 years ago

Blocks: 1289646

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

9 years ago

Blocks: 1311733

Steve Fink [:sfink] [:s:]

Assignee

Comment 1

•

8 years ago

Attached patch Report added tsan failures — Details — Splinter Review

I don't even remember what I was waiting on to land tsan as a tier 1 job, but I know I felt it was at least somewhat important to get somewhat decent reporting of added failures. This commit also bumps up the number of expected failures from 14 to 20 unique locations, which is not great.

Attachment #8840728 - Flags: review?(jseward)

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

8 years ago

Assignee: nobody → sphink

Status: NEW → ASSIGNED

Steve Fink [:sfink] [:s:]

Assignee

Comment 2

•

8 years ago

Attached patch Turn tsan into a canary — Details — Splinter Review

Ah! *This* is the important piece. Without this, a tsan error will make the test fail, which will turn the job red even if it is one of the expected errors.

Attachment #8840733 - Flags: review?(jseward)

Steve Fink [:sfink] [:s:]

Assignee

Comment 3

•

8 years ago

Two remaining problems. A bunch of tests are timing out when run under tsan. Unfortunately, they're jit-tests, which means we don't have a nice mechanism for logging all of the tests that take longer than n seconds so we can skip them. Hopefully the log will have enough info, or I can add in that option there too. (It's really annoying having 2 very similar test harnesses.) A bunch of new races have been added, probably from the new thread work bhackett has been doing. See bug 1342473. Currently, it looks like it's probably just that the JS test shell is using things in a way not yet supported. (The setup is for cooperative threads, and it looks like shell workers run concurrently on the same runtime.)

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

8 years ago

Depends on: 1342473

Julian Seward [:jseward]

Comment 4

•

8 years ago

Comment on attachment 8840728 [details] [diff] [review] Report added tsan failures Review of attachment 8840728 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/devtools/automation/variants/tsan @@ +7,5 @@ > "JITTEST_EXTRA_ARGS": "--jitflags=debug --ignore-timeouts={DIR}/cgc-jittest-timeouts.txt", > "JSTESTS_EXTRA_ARGS": "--exclude-file={DIR}/cgc-jstests-slow.txt", > "TSAN_OPTIONS": "log_path={OUTDIR}/sanitize_log" > }, > + "expect-errors": [ It would be nice to add a comment here explaining why some of these entries are duplicated. If not here, maybe in the Python part?

Attachment #8840728 - Flags: review?(jseward) → review+

Julian Seward [:jseward]

Updated

•

8 years ago

Attachment #8840733 - Flags: review?(jseward) → review+

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

8 years ago

Keywords: leave-open

Pulsebot

Comment 5

•

8 years ago

Pushed by sfink@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/dccd52fb0a8d Turn tsan into a canary, r=jseward https://hg.mozilla.org/integration/mozilla-inbound/rev/5806128b981e Report added tsan failures, r=jseward

Carsten Book [:Tomcat]

Comment 6

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/dccd52fb0a8d https://hg.mozilla.org/mozilla-central/rev/5806128b981e

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

8 years ago

Depends on: 1362239

Steve Fink [:sfink] [:s:]

Assignee

Comment 7

•

8 years ago

Attached patch Avoid problems with timeouts when running under tsan — Details — Splinter Review

Message in the commit message: Several jit-tests expect to timeout, and are annotated with an expected status code. Currently, we have to force tsan to report a zero status if it finds an error, since otherwise it will cause lots of tests to fail (due to hitting a tsan-detectable problem.) But those zero exit statuses cause the test to fail. Add --unusable-error-status to treat those as passing. Now, it turns out that this was *not* the reason for those tests to be failing. They didn't have any tsan errors, so did not trigger the forced-exitstatus setting. But this patch still seems worthwhile if we *do* run into that situation (with any expected error test, not just timeouts.)

Attachment #8865618 - Flags: review?(jcoppeard)

Steve Fink [:sfink] [:s:]

Assignee

Comment 8

•

8 years ago

Attached patch Skip wasm/asm.js for tsan test — Details — Splinter Review

After the next patch I'm going to post, the main purpose of this patch is to tune the set of flags that tsan is running with.

Attachment #8865619 - Flags: review?(jcoppeard)

Steve Fink [:sfink] [:s:]

Assignee

Comment 9

•

8 years ago

Attached patch Skip tests that depend on the wasm signal handler, since tsan interferes with it — Details — Splinter Review

It turns out that it's some sort of subtle interaction with how tsan manages signal handlers and the signal handler we use for wasm. Just skip these tests for now.

Attachment #8865621 - Flags: review?(jcoppeard)

Jon Coppeard (:jonco)

Comment 10

•

8 years ago

Comment on attachment 8865618 [details] [diff] [review] Avoid problems with timeouts when running under tsan Review of attachment 8865618 [details] [diff] [review]: ----------------------------------------------------------------- Yes, seems sensible.

Attachment #8865618 - Flags: review?(jcoppeard) → review+

Jon Coppeard (:jonco)

Updated

•

8 years ago

Attachment #8865619 - Flags: review?(jcoppeard) → review+

Jon Coppeard (:jonco)

Updated

•

8 years ago

Attachment #8865621 - Flags: review?(jcoppeard) → review+

Pulsebot

Comment 11

•

8 years ago

Pushed by sfink@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/3b3067e901d4 Avoid problems with timeouts when running under tsan, r=jonco https://hg.mozilla.org/integration/mozilla-inbound/rev/e03be5e59295 Skip wasm/asm.js for tsan test, r=jonco https://hg.mozilla.org/integration/mozilla-inbound/rev/3a43c13ce065 Skip tests that depend on the wasm signal handler, since tsan interferes with it, r=jonco

Steve Fink [:sfink] [:s:]

Assignee

Comment 12

•

8 years ago

Attached patch Update tsan failures to current set and move SM(tsan) to tier 1 — Details — Splinter Review

jonco, nothing much here to review. This moves SM(tsan) to tier 1. I can't think of any remaining reason not to now that it's green. I'll try to fix some of the failures in fast followups.

Attachment #8867031 - Flags: review?(jcoppeard)

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

8 years ago

Blocks: 1364287

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

8 years ago

Depends on: 1364288

Jon Coppeard (:jonco)

Comment 13

•

8 years ago

Comment on attachment 8867031 [details] [diff] [review] Update tsan failures to current set and move SM(tsan) to tier 1 Review of attachment 8867031 [details] [diff] [review]: ----------------------------------------------------------------- \o/

Attachment #8867031 - Flags: review?(jcoppeard) → review+

Carsten Book [:Tomcat]

Comment 14

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/3b3067e901d4 https://hg.mozilla.org/mozilla-central/rev/e03be5e59295 https://hg.mozilla.org/mozilla-central/rev/3a43c13ce065

Steve Fink [:sfink] [:s:]

Assignee

Comment 15

•

8 years ago

I don't think this verification approach is going to work. tsan is a dynamic checker, so the error count is random. Also, the reports of "expected error XXX not seen" are noise. I'm thinking a workable approach would be to run a bunch of times, collect all failures seen, and add them to the expected set. Then make the job fail whenever any new error appears, regardless of the error count. We will just have to accept that low-probability errors will appear occasionally, and we'll need to add them to the expected set. Distinguishing these from legitimate new errors will be the tough part. Also, this raises the question of when an error can be removed from the expected set. I think for now, we'll have to rely on people to remove errors when they believe they are fixing them, and allowing some number of stale errors to pile up over time. We can periodically purge. A better approach would be to report the results of every run to a common place, and notice when an error hasn't appeared in N runs. That smells like future work. Unfortunately, all of this does mean that new errors can hide on a per-file basis. Since line numbers are unstable, the expectation is a mapping of filenames to error counts. So if you introduce a new error in foo.cpp that already expects 3 errors, and you don't happen to hit one of the known errors, then the job will pass and some random later patch will be blamed when it reports 4 errors in that file. Hopefully this shouldn't come up often, and the best fix would be getting down to zero errors anyway.

Pulsebot

Comment 16

•

8 years ago

Pushed by sfink@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/2db29775e18f Update tsan failures to current set and move SM(tsan) to tier 1, r=jonco

Pulsebot

Comment 17

•

8 years ago

Pushed by philringnalda@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/64e6fe94b7b0 followup, revert the part of changeset 2db29775e18f which prematurely declared SM(tsan) tier-1

Phil Ringnalda (:philor)

Comment 18

•

8 years ago

No, absolutely not. https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Usable_job_logs - a TinderboxPrint which appears in the never-seen Job Details panel is not enough, we never should have let you get away with the way hazards are displayed, but at least they are better than this. It's not even tier-2 until it displays failure messages in a treeherder-parsable way in the Failure Summary panel. https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Runs_on_mozilla-central_and_all_trees_that_merge_into_it - I believe that it's the "run-on-projects: []", which my vague memory says is how you say that you only want a job to run on Try and only when someone explicitly asks for it, but it doesn't run on any trees other than Try when you ask for it, making it the weirdest tier-1 ever. https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Low_intermittent_failure_rate - five runs to declare victory was far too few, so I triggered another 45 on https://treeherder.mozilla.org/#/jobs?repo=try&revision=4c58a67710e6ffc72c8ebe049ab9425fe62c4f9a&group_state=expanded&noautoclassify and that's a 22% failure rate, not the barely tolerable for a new suite 5%.

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

8 years ago

Depends on: 1366522

Firefox Bug Husbandry Bot

Updated

•

8 years ago

Keywords: triage-deferred

Priority: -- → P3

Ted Campbell [:tcampbell]

Updated

•

7 years ago

Depends on: 1458240

Jan de Mooij [:jandem]

Comment 19

•

7 years ago

This seems fine now: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1f6d4b0df2226e001496060c4f4ba439e2b1f143&filter-tier=1&filter-tier=2&filter-tier=3&filter-searchStr=tsan&group_state=expanded We also dump errors to stderr now.

Flags: needinfo?(sphink)

Steve Fink [:sfink] [:s:]

Assignee

Comment 20

•

7 years ago

sheriffs - anything else you'd want us to do before making this tier 1? The known issues have been miraculously fixed, and all of the fancy output processing I had has been ripped out so that I believe this should report in the same way as eg asan builds. (Also, re: comment 18: I hadn't realized the hazard error reporting was a problem, so when philor complained I immediately jumped up, paused for a year, then filed bug 1459378. Time is slippery.)

Flags: needinfo?(sphink)

Flags: needinfo?(philringnalda)

Flags: needinfo?(aryx.bugmail)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 21

•

7 years ago

Looks good to me.

Flags: needinfo?(aryx.bugmail)

Phil Ringnalda (:philor)

Comment 22

•

7 years ago

Dunno, I guess maybe? There have been zero failures since last Friday, which is nice but makes it impossible to assess whether or not it actually displays its errors reasonably.

Flags: needinfo?(philringnalda)

Jan de Mooij [:jandem]

Comment 23

•

7 years ago

NI to make sure we don't forget about this (I'm worried it will regress quickly if we don't make it tier 1).

Flags: needinfo?(sphink)

GitHub Bugzilla PR Linker

Comment 24

•

7 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3553 — Details

Steve Fink [:sfink] [:s:]

Assignee

Comment 25

•

7 years ago

I wanted to check out what failures look like, so I introduced a bunch of races (it didn't complain about my first attempt!). I can't say it looks all that great: https://treeherder.mozilla.org/#/jobs?repo=try&author=sfink@mozilla.com&fromchange=02a45e75c05f98df89f87e1171e47b64f6309020&selectedJob=178122029&filter-tier=1&filter-tier=2&filter-tier=3 The number of errors is deceptive because this is failing on many, many tests. I scanned through try for a recent tsan failure. Here's one: https://treeherder.mozilla.org/#/jobs?repo=try&revision=89fdae9f4f5e6dbcf0abb820fd04b6cdaf3c953f&filter-searchStr=tsan&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=178305107 Its Failure Summary, in its entirety, is: TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/gc/bug-1136597.js | ================== (code 66, args "") [1.1 s] make[1]: *** [check-jit-test] Error 2 make: *** [check-jit-test] Error 2 Yuck. One thing that would make this much better, good enough for landing, is to just do what AddressSanitizer does and highlight the SUMMARY: line on failures. I've opened the PR https://github.com/mozilla/treeherder/pull/3553 for that. So I think I'll go ahead and land this tier-1 now.

Steve Fink [:sfink] [:s:]

Assignee

Comment 26

•

7 years ago

Attached patch Promote SM(tsan) to tier 1 — Details — Splinter Review

Ok, maybe I'll get sheriff review first, just to cover my argyle socks.

Attachment #8976025 - Flags: review?(aryx.bugmail)

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

7 years ago

Flags: needinfo?(sphink)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago

Attachment #8976025 - Flags: review?(aryx.bugmail) → review+

Treeherder GitHub Bugbot

Comment 27

•

7 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/ffdb06fc37b414cf77e03b532b7bcfe46f3d0ef1 Bug 1291954 - Highlight tsan errors (#3553)

Pulsebot

Comment 28

•

7 years ago

Pushed by sfink@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/a4257fbc42c8 Promote SM(tsan) to tier 1, r=Aryx

Pulsebot

Comment 29

•

7 years ago

Backout by sfink@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/88d835e400ea Backed out changeset a4257fbc42c8 because tsan is permafail right now

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

7 years ago

Depends on: 1459761

Pulsebot

Comment 30

•

7 years ago

Pushed by sfink@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/4481e5d95c62 Promote SM(tsan) to tier 1, r=Aryx

Jan de Mooij [:jandem]

Comment 31

•

7 years ago

(In reply to Pulsebot from comment #30) > Pushed by sfink@mozilla.com: > https://hg.mozilla.org/integration/mozilla-inbound/rev/4481e5d95c62 > Promote SM(tsan) to tier 1, r=Aryx \o/

Dorel Luca [:dluca]

Comment 32

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4481e5d95c62

Jon Coppeard (:jonco)

Comment 33

•

7 years ago

Nice!

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox62: --- → fixed

Keywords: leave-open, triage-deferred

Resolution: --- → FIXED

Target Milestone: --- → mozilla62

Ryan VanderMeulen [:RyanVM]

Comment 34

•

7 years ago

Confirmed running and visible on Treeherder :)

Status: RESOLVED → VERIFIED

status-firefox62: fixed → verified

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 35

•

7 years ago

Please add documentation for this test suite to https://developer.mozilla.org/en-US/docs/Mozilla/QA/Automated_testing and also explain what kind of failures shall be reported as restricted security bugs and which ones not. Thank you.

Flags: needinfo?(sphink)

Steve Fink [:sfink] [:s:]

Assignee

Comment 36

•

7 years ago

(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #35) > Please add documentation for this test suite to > https://developer.mozilla.org/en-US/docs/Mozilla/QA/Automated_testing and It is already covered under the SM(...) row, which links to <https://wiki.mozilla.org/Javascript:Automation_Builds>. I have replaced the contents of the tsan section on that page to remove the outdated whitelist info. Is that adequate? I could add a row specifically for this job, but it would turn out to be a complete duplicate of the SM(...) row. > also explain what kind of failures shall be reported as restricted security > bugs and which ones not. Thank you. I put in some weaselly language for that "ThreadSanitizer failures are generally not security-critical, though they can be." But my confidence in that assessment is not high. Mostly, if this build fails on the initial push of a problem, then it's somewhat likely that it could be a security issue -- but it doesn't really matter with this tier 1, since it'll immediately get backed out anyway. The intermittents are more problematic, and while I could imagine them indeed being security sensitive, I'm not sure if they are any more so than other intermittents? I dunno, other people may feel wildly differently.

Flags: needinfo?(sphink) → needinfo?(aryx.bugmail)

Steve Fink [:sfink] [:s:]

Assignee

Comment 37

•

7 years ago

Ok, that seemed a little too glib, so I updated to "ThreadSanitizer failures are generally not security sensitive, though they can be. The usual rules apply -- eg if it's in GC code, it should be assumed to be sensitive until investigated."

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 38

•

7 years ago

Thank you, that works.

Flags: needinfo?(aryx.bugmail)

Nicholas Nethercote [inactive]

Comment 39

•

7 years ago

Sounds reasonable to me too. I would say that TSAN errors aren't security sensitive the way ASAN errors are, for example.

You need to log in before you can comment on or make changes to this bug.