Closed Bug 1345071 Opened 8 years ago Closed 8 years ago

CLOSED TREES for Taskcluster Problems on integration and aurora trees

Categories

(Taskcluster :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Unassigned)

References

Details

Taskcluster seems to run into problems currently. We first encountered problems in nightlys with Bug 1345066 and now a) Decision tasks on autoland and inbound are started twice b) live log urls go to a local ip like https://gzbsawyaaaavvkdmqrgveamgolg26yxji4tqrw3y5zvkrb5e.taskcluster-worker.net:32805/log/bOXwwPYrTwGDU4R-LClhVQ also jhford mentioned issues he see, so something is going - closing integration + try trees for investigation of all this.
seems we stabilize so reopen the closed trees but leaving this bug open for investigation
Severity: blocker → normal
from jhford (posting here since he has bugzilla problems: The issue seems to be that there was an SSL problem in the uploads. The failure occured while doing uploads. The SSL cert for the queue works in firefox. I verified that cloud-mirror has a valid cert even though it is not involved in the upload chain at all. The error messages that I was seeing: > > [taskcluster:error] Error uploading "public/build/host/bin/mar" artifact. timeout of 30000ms exceeded > > [taskcluster:error] Error uploading "public/build/target.cppunittest.tests.zip" artifact. timeout of 30000ms exceeded > > [taskcluster:error] Error calling 'stopped' for artifactHandler : Encountered error when uploading artifact(s) > [taskcluster 2017-03-07 11:09:02.665Z] Unsuccessful task run with exit code: -1 completed in 2141.379 seconds Without more information in the debugging logs for the job, I can't tell exactly what timed out. I suspect that it was the S3 endpoint. Given that it seems to be working again, I think this was a momentary outtage that wasn't big enough to trigger a warning on the aws status site.
(In reply to Carsten Book [:Tomcat] from comment #0) > Taskcluster seems to run into problems currently. > > We first encountered problems in nightlys with Bug 1345066 and now > > a) Decision tasks on autoland and inbound are started twice They are being restarted due to failure, I think (at least from the rev in that bug). > b) live log urls go to a local ip like > https://gzbsawyaaaavvkdmqrgveamgolg26yxji4tqrw3y5zvkrb5e.taskcluster-worker. > net:32805/log/bOXwwPYrTwGDU4R-LClhVQ This is normal for running tasks. Looking at the decision task failures: [task 2017-03-07T08:50:05.897192Z] "PUT /queue/v1/task/dvkrOhASSAK0RYfeS6s7PQ HTTP/1.1" 200 378 [task 2017-03-07T09:04:30.756152Z] "PUT /queue/v1/task/Y2iohALXQBudfZttuNIVVQ HTTP/1.1" 400 10310 and that 400 was due, basically, to the PUT taking too long (more than 15 minutes). This operation does not depend on S3, but does depend on network between the heroku ELBs and workers. The result is the same, though: something failed in one of our providers' systems, and is now resolved.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.