Closed Bug 1207428 Opened 4 years ago Closed 4 years ago

Intermittent B2G device builds cache issue

Categories

(Taskcluster :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: nigelb, Assigned: wcosta)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 2 obsolete files)

We've been seeing these errors constantly for a while now and it's probably worth investigating what's causing them.
Dave, I'm told you might know something about the device builds. We see this and other spurious failures on the device builds. A restart usually makes it come back green. Any idea what we could do to chase these down?
Flags: needinfo?(dave.hunt)
From a conversation with wcosta

1:42 < wcosta> nigelb: it feels like it is cache bustage
11:42 < nigelb> wcosta: ouch. Anthing we can do about it?
11:42 < wcosta> we recently fixed this for gecko objdir, but out directory is more complicated, because the "out" dir is hardcoded in the build system
11:43 < wcosta> we would need to make per branch caches of the entire B2G repo, but it is too big for that
11:44 < nigelb> ouch :(
11:44 < wcosta> hrm
11:45 < wcosta> nigelb: we currently don't have a solution for this. garndt and I were discussing about this a few weeks ago
11:46 < nigelb> so the big question is, shall I open a bug to star these failures against
11:46 < wcosta> nigelb: the only solution is not caching B2G at all
11:46 < nigelb> that way we have an idea of how many failures we have.
11:46 < nigelb> Otherwise, it's going to get starred as infra and go into a black hole of us not realizing what's going on.
11:46 < wcosta> nigelb: well, all them falls in the same root cause
11:47 < wcosta> it is a shared TC/B2G bug for sure
11:47 < nigelb> Right, I can repurpose the existing bug for the other failures.
Flags: needinfo?(dave.hunt)
Summary: Intermittent make[6]: *** [/home/worker/objdir-gecko/objdir/addon-sdk/source/test/addons/.mkdir.done] Error 1 → Intermittent B2G device builds cache issue
Component: General Automation → General
Product: Release Engineering → Taskcluster
QA Contact: catlee
Duplicate of this bug: 1208729
Assignee: nobody → wcosta
Android binaries are stored in the "out" object directory. We need to
cache this directory per gecko branch. As the path to this directory is
hardcoded in the build system, we create a symbolic link inside B2G repo
to the volume mounted "out" dir.
Comment on attachment 8671515 [details] [diff] [review]
configure generic out dir per branch. r=garndt

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=531c3e28e11d
Attachment #8671515 - Flags: review?(garndt)
Status: NEW → ASSIGNED
Comment on attachment 8671515 [details] [diff] [review]
configure generic out dir per branch. r=garndt

Review of attachment 8671515 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/taskcluster/scripts/builder/build-emulator-x86.sh
@@ +34,5 @@
>  fi
>  
>  rm -rf $WORKSPACE/B2G/out/target/product/generic_x86/tests/
>  
> +gecko_objdir=/home/worker/objdir/gecko

I see that now /home/worker/objdir/ is volume mounted into the container.  Do we need to tell mozharness or the build system where to locate the objdir/gecko and objdir/out?

@@ +37,5 @@
>  
> +gecko_objdir=/home/worker/objdir/gecko
> +out_objdir=/home/worker/objdir/out
> +
> +if ! test -d out_objdir; then

Could you just call mkdir -p?  I dont' think it errors out if the directory is already there.

@@ +43,5 @@
> +fi
> +
> +# Remove out/ dir from old builds
> +if [ -d $WORKSPACE/B2G/out -a ! -h $WORKSPACE/B2G/out ]; then
> +  rm -rf out

should this be rm -rf $WORKSPACE/B2G/out ?
Attachment #8671515 - Attachment is obsolete: true
Attachment #8671515 - Flags: review?(garndt)
Android binaries are stored in the "out" object directory. We need to
cache this directory per gecko branch. As the path to this directory is
hardcoded in the build system, we create a symbolic link inside B2G repo
to the volume mounted "out" dir.
Comment on attachment 8671843 [details] [diff] [review]
configure generic out dir per branch. r=garndt

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9e1e0eaccfa1
Attachment #8671843 - Flags: review?(garndt)
Blocks: 1144808
can you submit a graph for b2g-inbound or something other than try to see it work when it has caches?  Try doesn't cache any of the workspace or object directories.
Comment on attachment 8671843 [details] [diff] [review]
configure generic out dir per branch. r=garndt

Review of attachment 8671843 [details] [diff] [review]:
-----------------------------------------------------------------

can flag me again for review once there is a graph submitted for a non-try branch so we can see how it behaves.
Attachment #8671843 - Flags: review?(garndt)
(In reply to Greg Arndt [:garndt] from comment #34)
> can you submit a graph for b2g-inbound or something other than try to see it
> work when it has caches?  Try doesn't cache any of the workspace or object
> directories.

Hrm, this fais:

$ ./mach taskcluster-graph --head-repository=http://github.com/walac/gecko-dev --head-rev=bugz/1207428-out-cache --owner=wcosta@mozilla.com --project=b2g-inbound --message='try: -p all' | taskcluster run-graph
Taskgraph creation error Error: Internal Server Error
{
  "message": "Internal Server Error",
  "error": {
    "info": "Ask administrator to lookup incidentId in log-file",
    "incidentId": "3ab74467-a367-478f-8975-73737e2de504"
  }
}
Flags: needinfo?(garndt)
(In reply to Wander Lairson Costa [:wcosta] from comment #37)
> (In reply to Greg Arndt [:garndt] from comment #34)
> > can you submit a graph for b2g-inbound or something other than try to see it
> > work when it has caches?  Try doesn't cache any of the workspace or object
> > directories.
> 
> Hrm, this fais:
> 
> $ ./mach taskcluster-graph
> --head-repository=http://github.com/walac/gecko-dev
> --head-rev=bugz/1207428-out-cache --owner=wcosta@mozilla.com
> --project=b2g-inbound --message='try: -p all' | taskcluster run-graph
> Taskgraph creation error Error: Internal Server Error
> {
>   "message": "Internal Server Error",
>   "error": {
>     "info": "Ask administrator to lookup incidentId in log-file",
>     "incidentId": "3ab74467-a367-478f-8975-73737e2de504"
>   }
> }

Hasn't this always failed?  I seem to recall this is an ongoing issue.
Flags: needinfo?(garndt)
As we talked on irc, I am posting only phone builds.

https://tools.taskcluster.net/task-graph-inspector/#hB57K3EaSwOz5wTjbEV76w/
Attachment #8671843 - Flags: review?(garndt)
Attachment #8671843 - Attachment is obsolete: true
Attachment #8671843 - Flags: review?(garndt)
From Bug 1212587 comment 4, we will need to make source and object dir branch specific.
See Also: → 1212587
We need to revisit the sizes of the entire workspace over time.  As it stands now, the flame-kk worker averages 14GB free and the emulators usually have 50gb free.  When I last was looking into this, an entire workspace could average around 20gb...so we might be able to have that many branches per machine.  Another solution is to just bite the bullet and make workers for emulator and phones be branch specific...so flame-kk-b2g-inbound, flame-kk-mozilla-central, etc.  Nasty, I know, but we're going to be pushing the limit of how much we could cache.  If we're removing caches a lot, then we lose all the benefit.
No longer blocks: 1144808
Comment on attachment 8687285 [details] [diff] [review]
Configure per branch object directory for nexus builds. r=garndt

https://treeherder.mozilla.org/#/jobs?repo=try&revision=72966d26c3f1
Attachment #8687285 - Flags: review?(garndt)
Keywords: leave-open
Comment on attachment 8687285 [details] [diff] [review]
Configure per branch object directory for nexus builds. r=garndt

Review of attachment 8687285 [details] [diff] [review]:
-----------------------------------------------------------------

This seems similar to what we did for other device builds.  I'm not sure how it knows to use the object directory cache now though.

Also, just to note, this isn't a complete fix but rather a bandaid that could hurt a lot later on.  We need to ultimately move to something that can do proper workspace caching.

::: testing/taskcluster/tasks/builds/b2g_nexus_4_eng.yml
@@ +6,5 @@
>  task:
>    workerType: flame-kk
>    scopes:
>      - 'docker-worker:cache:build-nexus-4-eng'
> +    - 'docker-worker:cache:build-nexus-4-eng-objdir-gecko-{{project}}'

So is there anything special to make use of this new objdir? I can't remember what we had to do for the other devices.  Also, shouldn't the "build-nexus-4-eng" cache point to "/home/worker/workspace" like aries does?
Attachment #8687285 - Flags: review?(garndt) → review+
(In reply to Greg Arndt [:garndt] from comment #47)
> Comment on attachment 8687285 [details] [diff] [review]
> Configure per branch object directory for nexus builds. r=garndt
> 
> Review of attachment 8687285 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This seems similar to what we did for other device builds.  I'm not sure how
> it knows to use the object directory cache now though.
> 

This is already configured in the build scripts.

> Also, just to note, this isn't a complete fix but rather a bandaid that
> could hurt a lot later on.  We need to ultimately move to something that can
> do proper workspace caching.
> 

I am leaving the bug opened until we have the final fix.

> ::: testing/taskcluster/tasks/builds/b2g_nexus_4_eng.yml
> @@ +6,5 @@
> >  task:
> >    workerType: flame-kk
> >    scopes:
> >      - 'docker-worker:cache:build-nexus-4-eng'
> > +    - 'docker-worker:cache:build-nexus-4-eng-objdir-gecko-{{project}}'
> 
> So is there anything special to make use of this new objdir? I can't
> remember what we had to do for the other devices.  Also, shouldn't the
> "build-nexus-4-eng" cache point to "/home/worker/workspace" like aries does?

The directories are configured in the cache section https://dxr.mozilla.org/mozilla-central/source/testing/taskcluster/scripts/phone-builder/pre-build.sh#9
Spoke to Wander about this on IRC.  So far it looks like the trend went down, but that could be our normal pattern leading into the weekend.  We're keeping track of these failures this week to hopefully see a permanent decrease.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.