Closed Bug 1139664 Opened 9 years ago Closed 9 years ago

Treeherder shows 3 jobs when 4 were run

Categories

(Tree Management :: Treeherder: Data Ingestion, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1093743

People

(Reporter: armenzg, Unassigned)

Details

If we load this we get 3 jobs:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=7ad48d9fb454&tochange=7ad48d9fb454&filter-searchStr=Windows XP 32-bit mozilla-inbound talos g1&filter-resultStatus=success&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=coalesced

If we look for "Windows XP 32-bit mozilla-inbound talos g1" in buildapi we will see 4 jobs (3 triggered and 1 run naturally):
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/7ad48d9fb454

In ci tools I wrote a script that shows the logic behind finding buildapi scheduling information and getting to the status info which leads me to the logs [1]
It seems that two of the jobs shared the same buildid (?) for the ftp url 1425491297.

Here's the commit:
https://github.com/armenzg/mozilla_ci_tools/commit/e7caaadc46a6483ef663fbd521c2ad30fd9875a9

If we would like to verify the integrity of the treeherder data we could work on a script for Q2.

[1]
armenzg@armenzg-thinkpad:~/repos/ci_tools$ python scripts/misc/find_logs_for_jobs.py -b "Windows XP 32-bit mozilla-inbound talos g1" -r 7ad48d9fb454
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1425491299/mozilla-inbound_xp-ix_test-g1-bm111-tests1-windows-build107.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1425491297/mozilla-inbound_xp-ix_test-g1-bm110-tests1-windows-build184.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1425491297/mozilla-inbound_xp-ix_test-g1-bm112-tests1-windows-build139.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1425401506/mozilla-inbound_xp-ix_test-g1-bm110-tests1-windows-build178.txt.gz
the treeherder view:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=7ad48d9fb454

I had triggered some talos g1 jobs for winxp, but look for winxp in general, there are no other test results showing at all.  but we have a bunch of 11 minute jobs in:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/7ad48d9fb454

I suspect those 11 minute jobs are coalesced, I am not sure.
self-serve does indeed show coalesced jobs as though they ran on all the revisions which were coalesced, so that it can Do The Right Thing if you retrigger them, and run them on the skipped revisions.

The only way I know of to determine that a job shown in self-serve was actually coalesced (beyond the obvious way, trust that treeherder is telling the truth when it does not show it) is to look at the start time for the job, and then look in treeherder for the next revision above the one you are interested in which actually shows the job, and see that that job started at the start time self-serve claims on the earlier revision. And then if you really don't trust anything, look at the log for that job and verify what build and what tests.zip it downloaded, since you wouldn't want to just trust that treeherder is showing it in the right place.
I believe we should:
(a) make buildapi continue to show these jobs, but actually display them as result=coalesced, or similar, since at the moment it plain out lies.
(b) fix bug 1093743, since there are a couple of issues with our logic there (though that leads to duplicate jobs in treeherder, not less)
(c) fix the _major_ builds-4hr inconsistencies I've found during analysis of the file recently (need to file a releng bug and write that up)
I think I figured out how to tackled this (at least in mozci).

Taken the buildapi data, I can cross reference it with the buildjson data.
I can tell that if a request is fulfilled by a different revision that is all it takes to determine that is coalesced. The same logic for build-4hrs applies.

This job is the one that coallesced:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/build/64476102

The request id, revision are:
(63329168, 7ad48d9fb454b0fbe503b38683b6bca9c955f34a)
The buildjson entry that fulfils it is composed of:
([63329168, 63329766], 9daf2b5ad7f8e43ebb74ce3cc695b6210c1d0eae)

I should a script tomorrow to test this theory.

[1]
        "request_ids": [
          63329168, 
          63329766
        ], 
        "request_times": {
          "63329168": 1425405222, 
          "63329766": 1425405494
        }, 
        "revision": "9daf2b5ad7f8e43ebb74ce3cc695b6210c1d0eae", 
        "scheduler": "tests-mozilla-inbound-win32-talos", 
        "script_repo_revision": "21f00dd7bda6", 
        "script_repo_url": "https://hg.mozilla.org/build/mozharness", 
        "slavebuilddir": "test", 
        "slavename": "t-xp32-ix-015", 
        "stage_platform": "win32"
      }, 
      "reason": "scheduler", 
      "request_ids": [
        63329168, 
        63329766
      ], 
      "requesttime": 1425405222,
In mozci, I can now differenciate the coalesced jobs from successful jobs:
https://github.com/armenzg/mozilla_ci_tools/commit/fed7a85b660169e9b6cb82513fc87874c520beca

I've written a script that counts properly how many jobs where coalesced. [1]

BTW, is treeherder showing the coalesced job on the wrong revision?
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=7ad48d9fb454&tochange=9daf2b5ad7f8&filter-searchStr=Windows+XP+32-bit+mozilla-inbound+talos+g1&filter-resultStatus=success&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=coalesced

It is showing the coalesced job for 9daf2b5ad7f8 instead of 7ad48d9fb454.


[1]
armenzg@armenzg-thinkpad:~/repos/ci_tools$ python scripts/misc/find_status_for_jobs.py -b "Windows XP 32-bit mozilla-inbound talos g1" -r 7ad48d9fb454
63329168 coalesced https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/build/64476102
{u'builder_id': 372220,
 u'buildnumber': 178,
 u'endtime': 1425408532,
 u'id': 60290835,
 u'master_id': 181,
 u'properties': {u'basedir': u'C:\\slave\\test',
                 u'branch': u'mozilla-inbound',
                 u'build_url': u'https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1425401506/firefox-39.0a1.en-US.win32.zip',
                 u'builddir': u'mozilla-inbound_xp-ix_test-g1',
                 u'buildername': u'Windows XP 32-bit mozilla-inbound talos g1',
                 u'buildid': u'20150303085146',
                 u'buildnumber': 178,
                 u'builduid': u'80df32318ed949a997c970b04bc6d878',
                 u'log_url': u'http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1425401506/mozilla-inbound_xp-ix_test-g1-bm110-tests1-windows-build178.txt.gz',
                 u'master': u'http://buildbot-master110.bb.releng.scl3.mozilla.com:8201/',
                 u'pgo_build': u'False',
                 u'platform': u'xp-ix',
                 u'product': u'firefox',
                 u'project': u'',
                 u'repo_path': u'integration/mozilla-inbound',
                 u'repository': u'',
                 u'request_ids': [63329168, 63329766],
                 u'request_times': {u'63329168': 1425405222,
                                    u'63329766': 1425405494},
                 u'revision': u'9daf2b5ad7f8e43ebb74ce3cc695b6210c1d0eae',
                 u'scheduler': u'tests-mozilla-inbound-win32-talos',
                 u'script_repo_revision': u'21f00dd7bda6',
                 u'script_repo_url': u'https://hg.mozilla.org/build/mozharness',
                 u'slavebuilddir': u'test',
                 u'slavename': u't-xp32-ix-015',
                 u'stage_platform': u'win32'},
 u'reason': u'scheduler',
 u'request_ids': [63329168, 63329766],
 u'requesttime': 1425405222,
 u'result': 0,
 u'slave_id': 4622,
 u'starttime': 1425406414}
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=7ad48d9fb454&tochange=9daf2b5ad7f8&filter-searchStr=Windows+XP+32-bit+mozilla-inbound+talos+g1&filter-resultStatus=success&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=coalesced
Status of all jobs (success, pending, running, coalesced)
(4, 0, 0, 1)
Buildapi says that there are 4 jobs [1]
treeherder believes there are 6 jobs (2 coalesced) [2]
mozci believes there are 4 jobs [3]

[1]
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/9daf2b5ad7f8
[2]
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=7ad48d9fb454&tochange=9daf2b5ad7f8&filter-searchStr=Windows+XP+32-bit+mozilla-inbound+talos+g1&filter-resultStatus=success&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=coalesced
[3]
armenzg@armenzg-thinkpad:~/repos/ci_tools$ python scripts/misc/find_status_for_jobs.py -b "Windows XP 32-bit mozilla-inbound talos g1" -r 9daf2b5ad7f8
Status of all jobs (success, pending, running, coalesced)
(4, 0, 0, 0)
If anyone wants I can modify the script to find out to which revisions those 2 jobs should belong.
It's expected some coalesced job data is incorrect at the moment, since we had to back out bug 1093743 for now.
Component: Treeherder → Treeherder: Data Ingestion
OS: Linux → All
Hardware: x86_64 → All
Should we dupe this?

Also catlee landed bug 1140612 today. Does it help?
(In reply to Armen Zambrano - Automation & Tools Engineer (:armenzg) from comment #9)
> Also catlee landed bug 1140612 today. Does it help?

I'll re-run my analysis script and see if it does.
Assignee: nobody → emorley
Let's handle this in bug 1093743.
Assignee: emorley → nobody
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.