Closed Bug 1285127 Opened 9 years ago Closed 9 years ago

Crash summary job for 2nd July is is failing

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: rvitillo)

Details

User Story

It seems that for the 2nd July any Scala based job eventually times out. I looked at the logs and found the following:

- A bunch of warnings like the following: WARN heka.Dataset$: Failure to read file telemetry-2/20160702/telemetry/4/main/Firefox/nightly/41.0a1/20150603030208/20160702112128.029_ip-172-31-5-69: Unable to execute HTTP request: Timeout waiting for connection from pool

- The thread dump of a running executor shows that it's waiting (TIMED_WAITING) at com.mozilla.telemetry.utils.S3Store$.getKey(S3.scala:43)

Apparently Bucket.getObject gets a connection handler from a pool and doesn't release it even when garbage collected, which in turn causes tasks to wait for a very long time.

Attachments

(1 file)

55 bytes, text/x-github-pull-request
mdoglio
: review+
Details | Review
No description provided.
Assignee: nobody → rvitillo
Severity: normal → blocker
Priority: -- → P1
Attached file PR
Attachment #8768724 - Flags: review?(mreid)
Attachment #8768724 - Flags: review?(mreid) → review?(mdoglio)
User Story: (updated)
Attachment #8768724 - Flags: review?(mdoglio) → review+
The missing day has been back-filled successfully.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: