Closed Bug 942616 Opened 11 years ago Closed 9 years ago

builds4hr.js.gz is missing attributes for some jobs (eg log_url)

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: catlee)

References

(Blocks 1 open bug)

Details

(Whiteboard: [treeherder][tbpl])

Attachments

(1 file)

That error is thrown here:
https://hg.mozilla.org/webtools/tbpl/file/1d44aff28ead/php/inc/RawGzLogDownloader.php#l23

$logURL comes from $run['log']:
https://hg.mozilla.org/webtools/tbpl/file/1d44aff28ead/php/inc/ParsedLogGenerator.php#l31

$run['log'] comes from:
https://hg.mozilla.org/webtools/tbpl/file/1d44aff28ead/php/inc/RunForLog.php#l13

Which is populated in the DB by:
https://hg.mozilla.org/webtools/tbpl/file/1d44aff28ead/dataimport/import-buildbot-data.py#l119

It looks like we check for the presence of 'log_url' in builds-4hr.js.gz and if not found, fall back to:
    33 log_path_try = Template("http://ftp.mozilla.org/pub/mozilla.org/$product/try-builds/$pusher-$rev/$branch_platform/$builder-build$buildnumber.txt.gz")
    34 log_path_other = Template("http://ftp.mozilla.org/pub/mozilla.org/$product/tinderbox-builds/$branch_platform/$buildid/$builder-build$buildnumber.txt.gz")

These URIs are now incorrect, so if builds4hr.js.g\z is missing log_url we fall-back to an incorrect log location.

1) Why is builds4hr.js.gz missing log_url for this job / why do we have the fall-back in the first place?
2) If the fall-back is really necessary, we should fix log_path_other (and likely also log_path_try) to include the buildbot master (though that property is currently not handled by the script).
TBPL link for the instance in comment 0:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=cf058c8a3b63&jobname=10.6.*debug.*mochitest-4

The builds json archive for the day that job ran:
https://secure.pub.build.mozilla.org/builddata/buildjson/builds-2013-11-24.js.gz

Searching for this job id (31019861) gives:
{
  "builder_id": 119768,
  "buildnumber": 1683,
  "endtime": 1385324440,
  "id": 31019861,
  "master_id": 88,
  "properties": {
    "basedir": "/builds/slave/talos-slave/test",
    "branch": "mozilla-inbound",
    "build_url": "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1385320806/firefox-28.0a1.en-US.mac64.dmg",
    "buildername": "Rev4 MacOSX Snow Leopard 10.6 mozilla-inbound debug test mochitest-4",
    "buildid": "20131124112006",
    "buildnumber": 1683,
    "builduid": "324227a28313426b8f046a208c3ab023",
    "master": "http://buildbot-master76.srv.releng.use1.mozilla.com:8201/",
    "pgo_build": "False",
    "platform": "macosx64",
    "product": "firefox",
    "project": "",
    "repo_path": "integration/mozilla-inbound",
    "repository": "",
    "revision": "cf058c8a3b63",
    "scheduler": "tests-mozilla-inbound-snowleopard-debug-unittest",
    "script_repo_revision": "dcb709d56ac2"
  },
  "reason": "scheduler",
  "request_ids": [
    32732112
  ],
  "requesttime": 1385322126,
  "result": 1,
  "slave_id": 1509,
  "starttime": 1385322159
}

Compare this with another "Rev4 MacOSX Snow Leopard 10.6 mozilla-inbound debug test mochitest-4" job:
{
  "builder_id": 119768,
  "buildnumber": 1684,
  "endtime": 1385323955,
  "id": 31019624,
  "master_id": 88,
  "properties": {
    "basedir": "/builds/slave/talos-slave/test",
    "branch": "mozilla-inbound",
    "build_url": "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1385321466/firefox-28.0a1.en-US.mac64.dmg",
    "buildername": "Rev4 MacOSX Snow Leopard 10.6 mozilla-inbound debug test mochitest-4",
    "buildid": "20131124113106",
    "buildnumber": 1684,
    "builduid": "b8227145ed0847fb9c72c812156355ec",
    "log_url": "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1385321466/mozilla-inbound_snowleopard-debug_test-mochitest-4-bm76-tests1-macosx-build1684.txt.gz",
    "master": "http://buildbot-master76.srv.releng.use1.mozilla.com:8201/",
    "pgo_build": "False",
    "platform": "macosx64",
    "product": "firefox",
    "project": "",
    "repo_path": "integration/mozilla-inbound",
    "repository": "",
    "request_ids": [
      32732448
    ],
    "request_times": {
      "32732448": 1385323147
    },
    "revision": "185c12e7ba93",
    "scheduler": "tests-mozilla-inbound-snowleopard-debug-unittest",
    "script_repo_revision": "dcb709d56ac2",
    "slavebuilddir": "test",
    "slavename": "talos-r4-snow-036",
    "stage_platform": "macosx64",
    "symbols_url": "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1385321466/firefox-28.0a1.en-US.mac64.crashreporter-symbols.zip"
  },
  "reason": "scheduler",
  "request_ids": [
    32732448
  ],
  "requesttime": 1385323147,
  "result": 0,
  "slave_id": 1507,
  "starttime": 1385323191
}

The following properties are missing:
* log_url
* request_ids
* request_times
* slavebuilddir
* slavename
* stage_platform
* symbols_url

Several of these appear to be set here:
http://mxr.mozilla.org/build-central/source/buildbotcustom/bin/postrun.py#305

Are we not hitting the |elif not options.statusdb_id| case when we should be?
Summary: Log URL guessing is no longer correct → import-buildbot-data.py's fallback for log_url being missing from builds4hr.js.gz is broken
Catlee, do you have any idea about this?

We've found jobs in builds4hr that are missing certain properties that are present in other jobs of the same type and tree...

(In reply to Ed Morley [:edmorley UTC+0] from comment #2)
> The following properties are missing:
> * log_url
> * request_ids
> * request_times
> * slavebuilddir
> * slavename
> * stage_platform
> * symbols_url


Possible cause..?

> Several of these appear to be set here:
> http://mxr.mozilla.org/build-central/source/buildbotcustom/bin/postrun.py#305
> 
> Are we not hitting the |elif not options.statusdb_id| case when we should be?

(In the past TBPL's tried to cope with a few inconsistencies itself, but many of those are now out of date - and I'd like to fix upstream if possible, to avoid the munging we do in https://hg.mozilla.org/webtools/tbpl/file/1d44aff28ead/dataimport/import-buildbot-data.py ).
Flags: needinfo?(catlee)
Markus, I don't suppose you can remember why we have the log_url fallbacks in:
https://hg.mozilla.org/webtools/tbpl/file/1d44aff28ead/dataimport/import-buildbot-data.py#l119

Has builds4hr always not been reliable in terms of missing properties such as log_url?

(Context being that these fallbacks are now using the wrong URL, which I could fix, but I'd rather remove them and fix builds4hr itself, since there are multiple other consumers of it, including treeherder).
Flags: needinfo?(mstange)
Mauro, we're seeing some inconsistencies/missing properties in builds4hr.js (see comment 3 and comment 4) - have you seen any of these whilst using builds4hr for treeherder? TBPL does a bunch of munging of the builds4hr json, guessing treeherder might as well? If so, would be good to upstream fixes to builds4hr...
Flags: needinfo?(mdoglio)
(In reply to Ed Morley [:edmorley UTC+0] from comment #4)
> Markus, I don't suppose you can remember why we have the log_url fallbacks
> in:
> https://hg.mozilla.org/webtools/tbpl/file/1d44aff28ead/dataimport/import-
> buildbot-data.py#l119

I know that I added the fallback code before there was a log_url field. When the log_url field was introduced, I kept the fallback, but I don't know why. It's possible that I've seen some runs without a log_url field at that time. But even if that was the case, this has probably changed since then... I recommend checking whether all runs on https://secure.pub.build.mozilla.org/builddata/buildjson/builds-2013-11-25.js.gz have a log_url field, and if they do, to get rid of the fallback path.
Flags: needinfo?(mstange)
(In reply to Markus Stange [:mstange] from comment #6)
> I recommend checking whether all runs on
> https://secure.pub.build.mozilla.org/builddata/buildjson/builds-2013-11-25.
> js.gz have a log_url field, and if they do, to get rid of the fallback path.

Sadly some jobs are missing the field, which is why we're occasionally using the (now broken) fall-back. I was just wondering if the unreliability of builds4hr was new or not :-)
Oh, I see. I don't know.
From #ateam:

mdoglio: I know jeads created a few months ago some statistics on what we see missing on build4hr
mdoglio: http://people.mozilla.org/~jeads/missing_attributes_hist.txt
(In reply to Ed Morley [:edmorley UTC+0] from comment #5)
> Mauro, we're seeing some inconsistencies/missing properties in builds4hr.js
> (see comment 3 and comment 4) - have you seen any of these whilst using
> builds4hr for treeherder? TBPL does a bunch of munging of the builds4hr
> json, guessing treeherder might as well? If so, would be good to upstream
> fixes to builds4hr...

Please don’t tell me that treeherder is actually polling those json files.
(In reply to Arpad Borsos (Swatinem) from comment #10)
> Please don’t tell me that treeherder is actually polling those json files.

Sadly the release engineering changes that would have been needed to use pulse never saw any traction (off the top of my head https://bugzil.la/862595,862612,862630,862633), so sadly they do :-(
I also found this file :jeads created which is focused on the missing log_url attributes.
http://people.mozilla.org/~jeads/missing_attributes_build4h.js
Unfortunately I don't remember much context about the collection of this data, better to ask him
Flags: needinfo?(mdoglio)
Flags: needinfo?(jeads)
Making this about fixing builds4hr - depending on the outcome of this, we can either remove the TBPL fall-backs or else tweak them.
Component: Tinderboxpushlog → General Automation
Product: Webtools → Release Engineering
QA Contact: catlee
Summary: import-buildbot-data.py's fallback for log_url being missing from builds4hr.js.gz is broken → builds4hr.js.gz is missing attributes for some jobs (eg log_url)
Version: Trunk → unspecified
Whiteboard: [treeherder][tbpl
Whiteboard: [treeherder][tbpl → [treeherder][tbpl]
(In reply to Ed Morley [:edmorley UTC+0] from comment #11)
> (In reply to Arpad Borsos (Swatinem) from comment #10)
> > Please don’t tell me that treeherder is actually polling those json files.
> 
> Sadly the release engineering changes that would have been needed to use
> pulse never saw any traction (off the top of my head
> https://bugzil.la/862595,862612,862630,862633), so sadly they do :-(

In that case, please don’t be offended by my obligatory facepalm and “I told you so”
(In reply to Ed Morley [:edmorley UTC+0] from comment #2)
> The following properties are missing:
> * log_url
> * request_ids
> * request_times
> * slavebuilddir
> * slavename
> * stage_platform
> * symbols_url

If i query the statusdb (which is where finished build data comes from) for build id 31019861 then I get:

log_url: "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1385320806/mozilla-inbound_snowleopard-debug_test-mochitest-4-bm76-tests1-macosx-build1683.txt.gz"
request_ids: [32732112]
request_times: {"32732112": 1385322126}
slavebuilddir: "test"
slavename: "talos-r4-snow-034"
stage_platform: "macosx64"
symbols_url: "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1385320806/firefox-28.0a1.en-US.mac64.crashreporter-symbols.zip"

We have redis as a cache between buildapi and the statusdb, and have seen redis get imcomplete build data before, which it then holds on to. We don't directly know the cause of the issue, but network connectivity in AWS could be a factor (eg build finishes and initial insert into statusdb happens, log upload and other post_run tasks get stalled and in the mean time builds-4hr.js.gz is generated, populating the redis cache). 

Perhaps the post_run tasks should be deleting the builds:<build_id> key in redis after they update the DB. I don't know if that will help if tbpl has already inserted the job in to the tbpl db, and then ignores that build on future crons.
Flags: needinfo?(catlee)
(In reply to Nick Thomas [:nthomas] from comment #15)
> Perhaps the post_run tasks should be deleting the builds:<build_id> key in
> redis after they update the DB. I don't know if that will help if tbpl has
> already inserted the job in to the tbpl db, and then ignores that build on
> future crons.

We could make TBPL reject jobs for which attributes are missing, and thus when it does the next import, they won't be present in the TBPL DB, so they'll still get inserted.

However the root cause here seems to be that we're populating builds-4hr.js with incomplete data, which harms not only TBPL but other consumers. We should ideally fix the redis cache issues or failing that, make the script/query that generates the builds-4hr.js output exclude incomplete jobs :-)
That seems a little heavy, I'd still want to know the result asap even if the log isn't available immediately, rather than builds disappearing into the voidl It would be more graceful if we fix postrun to expire the redis key when it updates the statusdb, then builds-4hr.js.gz gets the new properties the next minute (recaching in redis), and then tbpl picks that up.

In the cron for import-buildbot-data.py, do we pass -f so that the tbpl db is updated ? If not, are there any performance implications in turning it on ?
Flags: needinfo?(jeads)
(In reply to Ed Morley [:edmorley UTC+0] from comment #11)
> (In reply to Arpad Borsos (Swatinem) from comment #10)
> > Please don’t tell me that treeherder is actually polling those json files.
> 
> Sadly the release engineering changes that would have been needed to use
> pulse never saw any traction (off the top of my head
> https://bugzil.la/862595,862612,862630,862633), so sadly they do :-(

Doesn't the pulse data contain the same info as the build json? The bugs listed are enhancements that would certainly make things easier, but can't you use similar logic to handle the pulse data as you currently do for build json?

I suspect the root issue here is what Nick was describing: it's a race between when the build row is first inserted, when the build json is generated (and cached), and when the build properties are done being imported.
One idea Nick and I just discussed is to insert the initial row in the builds table with no start/end time, and only update those columns after all the properties have been inserted. This should prevent the build json report from seeing these builds until they're complete. Unknown what impact this will have on other consumers.
(In reply to Chris AtLee [:catlee] from comment #19)
> One idea Nick and I just discussed is to insert the initial row in the
> builds table with no start/end time, and only update those columns after all
> the properties have been inserted. This should prevent the build json report
> from seeing these builds until they're complete. Unknown what impact this
> will have on other consumers.

If we can do that, sounds great to me :-)
Depends on: 1079867
Would bug 1096877 make this occur more often? Or is this issue purely due to the redis caching, rather than a race with report generation?

Would it be possible to find someone to work on this? It means that for some jobs, people are unable to view the jobs on Treeherder/TBPL.
Flags: needinfo?(catlee)
Yes, I think generating the report more often is likely to increase the frequency of these missing properties.
Flags: needinfo?(catlee)
Blocks: 1096877
No longer blocks: 1096877
this seems to work on a small test case. the idea is that we override the build start/end time to NULL so the builds don't get picked up by builds-4hr, etc. Once all the properties and steps are added, we update the build with the proper times.
Attachment #8601838 - Flags: review?(nthomas)
Comment on attachment 8601838 [details] [diff] [review]
properties-buildbotcustom.diff

Seems reasonable to me.
Attachment #8601838 - Flags: review?(nthomas) → review+
Attachment #8601838 - Flags: checked-in+
Ed, does this look good?
Flags: needinfo?(emorley)
It's hard to prove a negative, but from a quick glance it looks fixed :-)
Assignee: nobody → catlee
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(emorley)
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: