Closed Bug 1072291 Opened 10 years ago Closed 10 years ago

Make pushlog ingestion more robust - round 2

Categories

(Tree Management :: Treeherder, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: mdoglio)

References

Details

Attachments

(1 file)

12:51 <zac> mdoglio|lunch, treeherder is not picking up new builds on b2g-i
13:50 <zac> https://treeherder.allizom.org/ui/#/jobs?repo=b2g-inbound
13:50 <zac> it's stuck on the 4:32am commit
14:01 <•mdoglio> zac: I'll have a look
14:14 <•mdoglio> zac: it's working now, I don't know yet what happened. I just restarted the ingestion service and everything seems to be working now
14:15 <zac> thanks mdoglio yeah I can see all the results no

We need to figure out the root cause of this, in case it can occur on prod too (and in case it's due to the changes in bug 1071577
Summary: Investigate the cause of ingestion failing on treeherder-dev → Investigate the cause of ingestion failing on treeherder stage
Blocks: 1072379
Summary: Investigate the cause of ingestion failing on treeherder stage → Investigate the cause of pushlog ingestion failing on treeherder stage
zac reported the same issue today
Assignee: nobody → mdoglio
We recently made some changes that have affected the network consumption. As a result, some tasks for data ingestion could take much more time than before and potentially being discarded because they exceed the maximum execution time currently set. Increasing that setting should solve this issue.
(In reply to Mauro Doglio [:mdoglio] from comment #2)
> We recently made some changes that have affected the network consumption. As
> a result, some tasks for data ingestion could take much more time than
> before and potentially being discarded because they exceed the maximum
> execution time currently set. Increasing that setting should solve this
> issue.

I really really think we should fix bug 1072422; this would solve the increased transfer time as well as the potential data loss by missing anything above the 10th push.
For the PR just opened, it doesn't add back the cache reset functionality, which means we'll have to manually reset in production. 

As an alternative to that or re-adding the django admin reset, we could just handle the 404 response from json-pushes (since we correctly get one from "fromchange",  unlike "startid"), and in that case fall back to no fromchange param, like we do when the cache is empty. 

Sound good? :-)
yeah, I'll add handling for the 404 from json-pushes
Status: NEW → ASSIGNED
Summary: Investigate the cause of pushlog ingestion failing on treeherder stage → Make pushlog ingestion more robust - round 2
No longer blocks: 1072379
Attachment #8496043 - Flags: review?(jeads)
Attachment #8496043 - Flags: review?(jeads) → review+
Commits pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/8f9a686fde80dbd8a11d60ae4a502a6a20cfdc99
(bug 1072291) revert pushlog caching strategy

The pushlog cache now uses the top revision of the last push.
Also, increase the time limit to fetch the pushlog to 3 minutes

https://github.com/mozilla/treeherder-service/commit/cb3d46df361ce7d03acdbde7d0d8da81aab711e9
Bug 1072291 - handle 404 responses from json-pushes

https://github.com/mozilla/treeherder-service/commit/31bfc111f6760a65bd85567bfa606a6e9d9190f6
Merge pull request #229 from mozilla/bug-1072291-fix-pushlog-ingestion

(bug 1072291) revert pushlog caching strategy
Depends on: 1074077
We should increase the timeout for the pushlog retrieval and add a clear_cache command to the update.py script for deployment. That would prevent bugs like bug 1074077 from happening again
Increasing the timeout is not necessary, I will just update the deploy script to clear the cache as part of every deployment
Depends on: 1074199
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: 1076750
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: