Closed Bug 1090284 Opened 10 years ago Closed 10 years ago

Investigate recent regression in pushlog ingestion performance

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: mdoglio)

References

Details

(Keywords: perf)

Attachments

(1 file)

Mauro mentioned in last week's meeting that there was a regression recently - we should figure out what caused it. I can't remember now - but was it on pushlog or job ingestion? (or both?)
Keywords: perf
It's on pushlog ingestion, here is my theory on that. This is a chart of the SQL volume in the last four weeks https://rpm.newrelic.com/public/charts/hsSFpmsqWtc I believe the huge increase there corresponds with the day bug 1083305 landed. There are pushes that contain thousands of revisions and the database take a lot of time to ingest those. 3 different insertions are required in order to store correctly a push: 1 insertion on the result_set table N insertions on the revision table N insertions on the revision_map table where N is the number of revisions for that push. Before the changes requested by bug 1083305 landed, the ingestion of these kind of pushes failed often on the second or third step. The push was then ingested again on the next round of data ingestion, but the corresponding result_set was already there and the push was skipped. This is not the case anymore because half-stored pushes will keep being submitted and the queries on revision and revision_map will be executed again and again.
I noticed yesterday that the production memcached instance doesn't contain a last_push key for all the repositories. And where the key is present, the value is empty. The result is that we keep ingesting data over and over no matter we stored it already. Still digging into what's causing this
attachment 8514303 [details] [review] verifies that we are correctly storing the last_push when we ingest a collection of result-sets
Blocks: 1076750
Summary: Investigate recent regression in ingestion performance → Investigate recent regression in pushlog ingestion performance
Blocks: 1096919
No longer blocks: 1080757
Component: Treeherder → Treeherder: Data Ingestion
The root cause was a missing netflow between the etl nodes and the memcached instances. This is fixed now
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee: nobody → mdoglio
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: