Closed Bug 1241624 Opened 10 years ago Closed 8 years ago

Failed Taskcluster task should not send updates to Balrog.

Categories

(Taskcluster :: Services, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: marcia, Unassigned)

References

Details

Attachments

(1 file)

Using yesterday's build, I am unable to perform an update to the next build. Coming from this build: Build ID 20160120143452 Gaia Revision efd70ba6a54849dcef696abf1652cf74daa07899 Gaia Date 2016-01-20 05:12:25 Gecko Revision https://hg.mozilla.org/mozilla-central/rev/6764bc656c1d146962d53710d734c2ac87c2306f Gecko Version 46.0a1 Device Name aries Firmware(Release) 4.4.2 Firmware(Incremental) eng.naoki.20151007.074137 Firmware Date Wed Oct 7 07:41:46 PDT 2015 Bootloader s1 Relevant part of the log: I/Gecko ( 1866): ######################## extensions.js loaded I/Gecko ( 317): *** AUS:SVC Creating Downloader I/Gecko ( 317): *** AUS:SVC UpdateService:_downloadUpdate I/Gecko ( 317): *** AUS:SVC readStringFromFile - file doesn't exist: /data/local/updates/0/update.status I/Gecko ( 317): *** AUS:SVC readStatusFile - status: null, path: /data/local/updates/0/update.status I/Gecko ( 317): *** AUS:SVC readStringFromFile - file doesn't exist: /data/local/updates/0/update.status I/Gecko ( 317): *** AUS:SVC readStatusFile - status: null, path: /data/local/updates/0/update.status I/Gecko:DumpUtils( 317): nsVolumeMountLock created for 'sdcard' I/Gecko:DumpUtils( 317): nsVolume: sdcard state Mounted @ '/mnt/shell/emulated/0' gen 1 locked 1 fake 1 media 1 sharing 0 formatting 0 unmounting 0 removable 0 hotswappable 0 I/Gecko:DumpUtils( 317): nsVolumeMountLock acquired for 'sdcard' gen 1 I/AutoMounter( 317): UpdateState: ums:A1C0E0 mtp:A1C0E1 mode:3 usb:0 tryToShare:0 state:IDLE I/AutoMounter( 317): UpdateState: Volume sdcard1 is Mounted and inserted @ /storage/sdcard1 gen 2 locked 0 sharing en-n I/AutoMounter( 317): UpdateState: Volume sdcard is Mounted and inserted @ /mnt/shell/emulated/0 gen 1 locked 1 sharing x I/Gecko ( 317): *** AUS:SVC Downloader:downloadUpdate - downloading from https://queue.taskcluster.net/v1/task/UC4qXxUZSnyaI07X__sXaA/runs/0/artifacts/public/build/b2g-aries-gecko-update.mar to /storage/emulated/legacy/updates/0/update.mar I/GeckoDump( 317): XXX FIXME : Got a mozContentEvent: update-available-result I/Gecko ( 317): *** AUS:SVC Downloader:onStartRequest - original URI spec: https://queue.taskcluster.net/v1/task/UC4qXxUZSnyaI07X__sXaA/runs/0/artifacts/public/build/b2g-aries-gecko-update.mar, final URI spec: https://queue.taskcluster.net/v1/task/UC4qXxUZSnyaI07X__sXaA/runs/0/artifacts/public/build/b2g-aries-gecko-update.mar I/Gecko ( 317): *** AUS:SVC Downloader:onStopRequest - original URI spec: https://queue.taskcluster.net/v1/task/UC4qXxUZSnyaI07X__sXaA/runs/0/artifacts/public/build/b2g-aries-gecko-update.mar, final URI spec: https://queue.taskcluster.net/v1/task/UC4qXxUZSnyaI07X__sXaA/runs/0/artifacts/public/build/b2g-aries-gecko-update.mar, status: 2147549183 I/Gecko ( 317): *** AUS:SVC Downloader:onStopRequest - status: 2147549183, current fail: 0, max fail: 20, retryTimeout: 30000 I/Gecko ( 317): *** AUS:SVC Downloader:onStopRequest - non-verification failure I/Gecko ( 317): *** AUS:SVC getStatusTextFromCode - transfer error: Failed (unknown reason), default code: 2152398849 I/Gecko ( 317): *** AUS:SVC getFileFromUpdateLink linkFile.path: /data/local/updates/0/update.link, link: /storage/emulated/legacy/updates/0/update.mar I/Gecko:DumpUtils( 317): nsVolumeMountLock released for 'sdcard' I/Gecko ( 317): *** AUS:SVC Downloader:onStopRequest - setting state to: download-failed I/Gecko ( 317): XXX FIXME : Dispatch a mozChromeEvent: update-download-stopped I/Gecko ( 317): *** AUS:SVC Downloader:onStopRequest - all update patch downloads failed I/Gecko ( 317): UpdatePrompt: Update error, state: , errorCode: 0 I/Gecko ( 317): XXX FIXME : Dispatch a mozChromeEvent: update-error I/Gecko ( 317): UpdatePrompt: Setting gecko.updateStatus: Failed (unknown reason) I/Gecko:DumpUtils( 317): nsVolume: sdcard state Mounted @ '/mnt/shell/emulated/0' gen 1 locked 0 fake 1 media 1 sharing 0 formatting 0 unmounting 0 removable 0 hotswappable 0 I/AutoMounter( 317): UpdateState: ums:A1C0E0 mtp:A1C0E1 mode:3 usb:0 tryToShare:0 state:IDLE
https://tools.taskcluster.net/task-inspector/#UC4qXxUZSnyaI07X__sXaA/ is a nightly build which is a failed build I think we need to change it up so any failed build on taskcluster should not report to Balrog.
Hey Wander, I think I'll need you to help point me in the right direction to fix this...
Flags: needinfo?(wcosta)
How do I figure out the result from taskcluster before it gets pushed to balrog? Can we do something like this? I haven't looked at that part of the code, so I'm not sure...
Oops. Redirecting to the more appropriate person. Jlund, how does taskcluster interact with balrog? Could I get a short synopsis please? I basically want to do an equivalent to: if(build_success)then{balrog};
Flags: needinfo?(wcosta) → needinfo?(jlund)
Sorry, I was redirected to jlund. I think this should be asked to garndt; looking at https://github.com/taskcluster/docker-worker/pull/77
Flags: needinfo?(garndt)
Product: Firefox OS → Taskcluster
Summary: [Aries]Update fails from 20160120143452 → Failed Taskcluster task should not send updates to Balrog.
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #5) > Sorry, I was redirected to jlund. I think this should be asked to garndt; > looking at https://github.com/taskcluster/docker-worker/pull/77 All that happens with enabling that is a container that has a vpn connection to the datacenter with a network flow opened to the balrog server. All choices about what should get updated and when is not done by that feature. It seems that whatever does the build should not perform the balrog step if the build failed. That is done within the task itself.
Flags: needinfo?(garndt)
Flags: needinfo?(jlund) → needinfo?(garndt)
No. The code you're does not work that way. There are two stages to this "feature" with the worker..link and killed. In the link stage this code does nothing more than check to make sure your task can use that feature and then starts a particular docker container and links it to your task container under the name "balrog". The killed state just stops the container. There is no logic within this code that handles updating to balrog, it just provides a network link. This code makes no judgement about what is running within the task container. By the time the worker knows that the task did not complete successfully, it's already too late as the code running within the container would have contacted balrog. The harness for building and sending updates should handle not updating balrog when the build step fails. This is all within the task container and harness being used, not with the proxy.
Flags: needinfo?(garndt)
Oh. I see what you're saying now. :| Ok. So I need to figure out how to get the task to know if the task itself failed before contacting balrog...
Still not sure where the magic happens; I think I need to know where balrog/docker-worker.py is?
Flags: needinfo?(wcosta)
wcosta replied : gecko-dev/testing/mozharness/configs/balrog Not sure why mxr didn't find it...
Flags: needinfo?(wcosta)
gecko-dev/testing/mozharness/scripts/b2g-build.py seems to have some magic to it.
Attached patch Github patchSplinter Review
Not 100 % sure if this will work at all; looking for feedback first before I fix up the script for syntax. Basically wanting to get some build error code returned from the python script, and then passed to the post-build script and the pushing out to balrog following in the post-build script.
Attachment #8712444 - Flags: feedback?(wcosta)
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #13) > Created attachment 8712444 [details] [diff] [review] > Github patch > > Not 100 % sure if this will work at all; looking for feedback first before I > fix up the script for syntax. > > Basically wanting to get some build error code returned from the python > script, and then passed to the post-build script and the pushing out to > balrog following in the post-build script. This is not going to fix the bug either. This is what happened: 1) The task started and called mozharness. 2) Mozharness built the image and uploaded update data to Balrog. 3) *After* the build finished and files were copied to artifacts directories, docker-worker tried to upload the mar file to S3, but got a connection reset failure. 4) The phone issued an update request to Balrog. 5) Balrog sent the update information with the link to the mar file. 6) The phone tried to download the mar file, but as the artifact upload failed earlier, the phone probably got a 404. 7) The update failed. There is only one way to prevent this to happen: build the image and updating to balrog should be two different tasks. This has been discussed in releng in the past, but I have no idea the state of this now, probably catlee knows that better. Greg, is it possible to check if retry upload has been deployed in this worker? If so, I wonder if the retry failed all the times or if there is bug that upload doesn't retry in case of a connection reset error.
Flags: needinfo?(garndt)
Attachment #8712444 - Flags: feedback?(wcosta) → feedback-
Hrm, I don't think retry upload has been uploaded to the balrog workers, there was an issue with scope checking for the balrog feature that I need to look into before we upgrade those workers. (I'm attempting to remember what the issue was that prevented it)
Flags: needinfo?(garndt)
Component: General → Integration
Component: Integration → Platform and Services
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Component: Platform and Services → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: