Closed Bug 1075176 Opened 10 years ago Closed 10 years ago

Pushlog ingestion sometimes caches the wrong revision as the latest for each repo

Categories

(Tree Management :: Treeherder, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: mdoglio)

References

Details

Attachments

(1 file)

Ryan noticed that two pushes were missing from b2g32 and some also from mozilla-aurora.

It is my understanding that we should be doing something like:

1) Check cached value for last-seen-rev for repo X
2) Poll json-pushes for &fromchange=that-value
3) Attempt to insert those pushes into the DB
4) Only if the insert of all records was successful, update last-seen-rev in the cache to be the latest push revision seen

Given this is failing, we're either:
* Updating the cached value when the insert hadn't completed successfully
* Updating the cached value to the wrong revision (ie: we picked the revision incorrectly from the array of pushes)
jeads think this may be the cause:

https://github.com/mozilla/treeherder-service/blob/master/treeherder/etl/pushlog.py#L93
>    sorted_pushlog = sorted(extracted_content.values(),
>                            key=itemgetter('date'), reverse=True)
>    last_push = sorted_pushlog[0]

I think sorting by date is incorrect - we should sort by pushID (which is actually the original order).
(In reply to Ed Morley [:edmorley] from comment #1)
> I think sorting by date is incorrect - we should sort by pushID (which is
> actually the original order).

Or more:
    last_push = extracted_content.values()[-1]
Assignee: nobody → mdoglio
Status: NEW → ASSIGNED
Attachment #8498047 - Flags: review?(emorley)
Comment on attachment 8498047 [details] [review]
Github PR #235 on treeherder-service

This patch is definitely the right way to do this, so we should take it either way - however it got me thinking:

Surely if we pick the wrong push as the top push (ie like we do prior to this patch), and so had a "wrong" value for top_revision - surely this would just mean on the next poll, we request _more_ pushes than we need (since we'll end up requesting some we've already seen) - and as such, how are we losing data?

eg:
1) We poll json-pushes and get pushes ["A","B","C"].
2) We incorrectly store the rev for push "B" as the top revision.
3) On next poll we call json-pushes?fromchange=REV-OF-B, which returns ["C","D","E"].
4) We then attempt to insert those pushes, notice that "C" is already handled and the revisionhash in the cache, so only insert "D" and "E".

Unless at step#4 we're actually aborting the entire insert after spotting the dupe "C", rather than assessing each push individually?
Attachment #8498047 - Flags: review?(emorley) → review+
(In reply to Ed Morley [:edmorley] from comment #4)
> Unless at step#4 we're actually aborting the entire insert after spotting
> the dupe "C", rather than assessing each push individually?

If this is the case, then whilst the attached patch should definitely fix most cases where this would be exposed, I guess we're still at risk of race conditions between workers, eg:

1) Worker 1 polls json pushes and gets ["A","B","C"], but the polling took ages, so it hasn't yet inserted into the DB before worker 2 is spawned
2) Worker 2 polls json pushes and gets ["A","B","C","D"] since someone has just pushed. However it hasn't yet inserted this into the DB
3) Worker 1 finally is able to insert ["A","B","C"] and updated top_revision to that of push C.
4) Worker 2 tries to insert ["A","B","C","D"], however if the guess from comment 4 is correct, it will fail to insert "D" (due to the dupes of A-C), but it still then updates top_revision to that of push D.
5) Worker 3 is spawned later, sees the top_revision of D, so does a &fromchange=REV-OF-D, which will only return "E" onwards.
-> We've missed "D".
Commits pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/980ccd36b531dee3823ce850669fb5d0c03c3a0d
Bug 1075176 - use pushId instead of commit date

When fetching the pushlog we cache the top revision of the last push fetched
to subsequently request deltas. To determine the last push of a pushlog
response, we need to use the pushId.

https://github.com/mozilla/treeherder-service/commit/43e47cd7d17684a61cfc485baf5c871f38f4f1cb
Merge pull request #235 from mozilla/bug-1075176-use-pushId-to-sort-pushlog

Bug 1075176 - use pushId instead of commit date
The attached PR has been merged (and deployed on dev), but leaving open for comment 4 and comment 5.
jeads, Mauro is PTO today - could you take a look at the remainder of this bug? :-)
Flags: needinfo?(jeads)
In fact, let's make this bug specific to the problem the PR solves, and break out the other issue to a new bug.
Blocks: 1076750
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(jeads)
Resolution: --- → FIXED
Summary: Pushlog ingestion can still miss pushes → Pushlog ingestion sets the cached top_revision to the wrong value
Summary: Pushlog ingestion sets the cached top_revision to the wrong value → Pushlog ingestion sometimes caches the wrong revision as the latest for each repo
Filed bug 1076752.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: