Open Bug 1172048 Opened 9 years ago Updated 2 years ago

add additional data to SETA to denote which test files fail in addition to the job


(Testing :: General, defect)



(firefox41 affected)

Tracking Status
firefox41 --- affected


(Reporter: jmaher, Unassigned)



(2 files)

having more raw data about what is failing will allow us to make better decisions as we change chunks.

My proposal is for each new regressions that we analyze the log/failure/etc. and come up with the root cause.  We store this in a new field in the DB.  In fact, i want 2 new fields, test file and test dir.

I am not sure if we parse the logs, or if we do other stuff like find it in treeherder, bugzilla, etc.

Ideally we will be able to produce a list of directories which we could run for the root cause.
Kyle, do you know if we could query this info from activedata?
Flags: needinfo?(klahnakoski)
Probably.  ActiveData now has subtest-level detail, with messages, and includes he full path of the test, so I think what you need is in there.  I suggest we start with a specific failure so I can show you what the data looks like, and I can understand how you want to use it.
Flags: needinfo?(klahnakoski)
here is a commit:

there are 4 failures found which are:
fixed by commit 5e803cc779dc:
linux64 debug M-dt2

and fixed by commit 17eb57fa30c7:
linux32 pgo m-bc1
linux32 pgo m-e10s-bc1
windows xp debug crashtest

I want to know for the 4 failures what is the failing case and testdir:
linux64 debug M-dt2:
* browser/devtools/projecteditor/test/browser_projecteditor_contextmenu_02.js
* browser/devtools/projecteditor/test

linux32 m-[e10s-]bc1:
* browser/base/content/test/general/browser_parsable_script.js
* browser/base/content/test/general

windows xp debug crashtest:
* tests/dom/media/test/crashtests/868504.html (crash then leak)

this isn't perfect, but I think getting something will go a long way towards helping us understand issues better.
Attached file Failures seen
Here is a list of the test failures I have seen.  I apologize for the format of the `result.subtests`:  I could not get the JSON of those nested documents to flatten better [2].   The JSON format is fine [1], and may be well suited to automations, but is harder for humans to cross-reference.

> {
> 	"from":"unittest",
> 	"where":{"and":[{"eq":{"build.revision12":"7380457b8ba0","result.ok":false}}]},
> 	"limit":1000,
> 	"format":"list"
> }

thanks for this info Kyle!

I cannot find the windows xp debug crashtest issue in the list, but I do see the other ones.  I assume since reftests are not using structured logs, that would explain why we don't have it.

also there is no correlation to the chunk number, right now that is the only way we can determine how the job was run.  Instead of mochitest-devtools-chrome-2, I see mochitest-devtools-chrome.

Lastly there is quite a few other errors in the list- even narrowing it down for platforms we care about, it becomes difficult to figure out which are caused by a real failure vs intermittent.  do we have treeherder failure classifications yet?  I could have overlooked that.

We might be close to a solution here.
On reftests:  Yes, ActiveData only ingests structured logs, so if reftests do not generate them, then we are out of luck.

On chunks:  My apologies, I must have accidentally deleted them from the spreadsheet.  Looking at the raw JSON [1]: The chunk number is there.

I believe the next step is to show the history of each of these tests, which is intended to give an idea of how often each is failing, and include the the bug number (with description and status) being used to track each.

FWIW reftest structured logging is bug 1034290. I think we should pick that up in Q3.
Pulling in Treeherder classifications would be good.  Cam, is there a way Kyle can query the Treeherder API for all of the classifications on a given tree and revision?
Flags: needinfo?(cdawson)
I met with Kyle on vidyo and we went over how I get the fc for a given revision:

1) get result_set_id:
** look for "id" at the end
2) get jobs list using result_set_id:

now look for all jobs that have a status != 'success', and then you can take the jobid for the failing job and get the failure_classification as well as notes:

we would have to apply the failure_classification for a job to all failures inside that job- some edge cases here, but it moves the needle.
for querying the treeherder api, there is a good thin wrapper library called th-client:

the two api's that Joel mentioned above is present there [1], [2].

[1] -
[2] -
Joel's urls are good ones there.  Though you can also do filtering on some of them that will help a bit.  For instance, to find only the classified jobs for the resultset, you can use:


failed (classified or not),testfailed,exception

If you needed the content of what the classification was, you'd need to query the /notes/ endpoint, as Joel says.  

That being said, if we want to do this a lot, we can create an endpoint go give you this data without having to do all these calls.  Please file a bug with the requirements and we'll make it for you.  :)
Flags: needinfo?(cdawson)
Also: These are the fields in various tables that support filtering:

And here are the filter types:

Hmm..  Seems like we should document this, huh?  :)
I am attempting to make a script that will take failures in ActiveData, and find the matching notes in TreeHerder.  I am having difficulty finding the note "fixed by commit 5e803cc779dc from fx-team"  for Joel's example in comment 3.

In the meantime, for my own notes, here is the query to get some failures:
> {
> 	"from":"unittest",
> 	"select":[
> 		{"name":"branch","value":"build.branch"},
> 		{"name":"revision","value":"build.revision12"},
> 		{"name":"suite","value":"run.suite"},
> 		{"name":"chunk","value":"run.chunk"},
> 		{"name":"test","value":"result.test"}
> 	],
> 	"where":{
> 		"and":[
> 			{"eq":{"result.ok":false}},
> 			{"gt":{"run.timestamp":"{{today-day}}"}}
> 		],
> 		"format":"list",
> 		"limit":10
> 	}
> }

and here is the (probably wrong) code that attempts to pull the notes:
not sure I understand the code to get the notes, but it looks like it is working.  As a reminder, it won't say fixed by commit <rev>, the note will just be the <rev>.

Keep in mind that sometimes the notes can be more than one revision, so treat the notes field as a comma separated list of revisions.  A few instances show a bug number, random text (i.e. various backouts, upstream l10n issue, etc.), or a full link to a revision (i.e.  Even random things show up- it is lovely, but 95% is just a single <rev>.
This sounds like a feature request for treeherder to ensure that there's unambiguous data we can use here?

Thanks for this attachment:  I was not sure what I was seeing in the notes until you explained it further.

May you show me the code you used to pull that list of notes?
Flags: needinfo?(jmaher)
kyle, the code is in this file:

it is fairly easy to follow, but then again I am sort of an expert in that file- maybe it could use more explanation.
Flags: needinfo?(jmaher)
Just to confirm, you got the list of notes after visiting many results, not by some neato web API.  Yes?
Flags: needinfo?(jmaher)
correct- whenever I have a result of a job with a non success, I query for the notes.
Flags: needinfo?(jmaher)

Some work has been done on collecting the "notes" (aka causes) and "stars" (intermittents) from Treeherder [1], and then markup he tests in ActiveData [2].  Roughly, the steps are:

* lookup revision to find result_set_id
* lookup result_set_id get list of job_ids
* use bug-job-map to find intermittent markup[3] by bug number
* lookup notes for list of job_ids
* require regex to pull revision from notes 
* use revision to pull hg, 
* require regex to inspect hg log note for bug number

These steps are compunded by the fact they can be done "late" by the sheriffs; which requires multiple passes to capture.  

Even at this point, the match is incomplete.  We know the markup the sheriffs have applied to the job, but not the specific test in that job.  Inside that job there can be multiple test failures, and multiple reasons for failure.  We should assign the markup to the one test failure being blamed both for sake of accuracy (avoiding double counting) and for verification that the ActiveData data path and the TreeHerder datapath match very well.

In general, the complexity of this problem is born from comparing two sophisticated systems and  understanding thier differences enough to perform a reconcilation, despite they talk at different levels of granularity.

There is also an issue of scale: With over 300K test failures a month, the above process will put (?significant?) load on the TreeHerder service.

I require guidance on deciding the next direction: 

1) Stop work on this feature for now: Wait for Treeherder enhancements that will allow it to "talk" about individual test failures, and formalize "causes" found in the `notes` 
2) Markup ActiveData with the slightly mismatching Treeherder markup and get a feel for how accurate it is already.  It may be useable despite having known problems.  This runs the risk of resulting in pure junk.
3) Continue with reconcilation and find the best individual test failure that deserves the Treeherder markup.  Then, markup ActiveData and write queries to understand the impact.  This work will be redone when Treeherder can talk in terms of individual test failures.

[3] The bug-job-map does not contain the specific comment generated; we must still load the bugzilla comments, and find the matching revision to establish the exact test/reason in the job which is getting the blame.
I think this risks overlapping the work we intend to do this quarter with automatic starring.  For Joel's purposes, it seems like it might be enough to identify the first test file which fails in a job, and its directory; I'm not sure whether we actually need more detailed information here.  Maybe Joel can comment.
There is a danger of overlap with automatic staring.  I need to fully understand the scope of it (almost there) and work towards what we can do in the next 3-6 months to ensure SETA sets us up for success going forward.

If this is too complex of a system, I could easily take my data and when we find a root cause failure (we know the test job that failed to cause a backout), we could just download that log and parse it to collect all related failures and assume that every failure in there is an issue.  In fact, I could probably take something from structured logs and do this quickly.  There will be the same issues that Kyle mentions above.

The main difference of SETA vs failures in general is that SETA only deals with failures that are regressions linked directly to a push in the recent pushlog.  

Let me work with jgraham this week to understand the auto staring in more details (it sounds like a few stages) so I can understand what pieces would remain to do here.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.