Decide if pushlog data prior to June 2014 is required

RESOLVED WONTFIX

Status

Tree Management
Treeherder
P1
normal
RESOLVED WONTFIX
4 years ago
4 years ago

People

(Reporter: emorley, Unassigned)

Tracking

Details

When importing a new repository for the first time, Treeherder imports the pushlog using json-pushes with no startID params, which returns only the last 24 hours worth of pushes. From then on, only new pushes are imported.

This means that on production, we don't have pushlog history prior to when the repo was added/production went live. For mozilla-central this is June 2014 [1].

My concern with this is that people are going to be encouraged to use Treeherder as a single point of truth for push & job data, but yet:
1) They won't be able to perform pushlog analysis such as "compare backout to landing ratios" or "which person / parts of the tree result in the most backouts?" for pushes earlier than when each repo is added - unless they fall back to using the pushlog directly.
2) The push IDs for pushlog and treeherder will be different, which might cause confusion. (Though I think this is less of an issue than #1).

Aiui importing history data would involve:
* Development time in the pushlog ingestion code to make it request all previous pushes (see also gps' Hg extension that does similar, using this lib [2]) - potentially using pagination to avoid load issues.
* Resetting all treeherder DBs for Hg projects & starting from scratch since the result set (push) IDs would be different.
* A fair bit of CPU time spent reimporting these repos, using the updated pushlog ingestion script. (NB: I'm explicitly not saying we should import old job data, just the pushlog).

We need to decide ASAP whether this is something we're going to - since in a few weeks time it will be too late, due to requiring a DB reset.


[1] https://treeherder.mozilla.org/api/project/mozilla-central/resultset/1/ which corresponds to https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=86aa28ce309e
[2] https://hg.mozilla.org/hgcustom/version-control-tools/file/3a3edd2df65e/pylib/mozautomation/mozautomation/changetracker.py#l92 -> https://hg.mozilla.org/hgcustom/version-control-tools/file/3a3edd2df65e/pylib/mozautomation/mozautomation/repository.py#l246
Summary: Decide if pushlog history prior to June 2014 is required → Decide if pushlog data prior to June 2014 is required
With the following premise: 
1- we don't need to store past jobs
2- the data format for for the past result sets is the same as the one we usually get for json-pushes
I think this will not require much effort to implement. We should just verify 2 is correct.
Also, I think we could do a DB reset whenever we want and import the pushlog history at a second time. The only downside would be to have older resultsets in the database with a higher id, but I don't think this will be a problem
Discussed on IRC a bit, but I'm having a hard time justifying this taking priority over other work right now. I'd lean towards doing this as a nice to have if the bullet points in comment 0 aren't too much work, but otherwise drop it and just tell people that Treeherder is the single source of truth moving forward with no guarantees about past data.
(In reply to Mauro Doglio [:mdoglio] from comment #1)
> With the following premise: 
> 1- we don't need to store past jobs

Agreed

> 2- the data format for for the past result sets is the same as the one we
> usually get for json-pushes

Recent push:
https://hg.mozilla.org/mozilla-central/json-pushes?full=1&startID=27490&endID=27491

Older push:
https://hg.mozilla.org/mozilla-central/json-pushes?full=1&startID=15000&endID=15001

First push:
https://hg.mozilla.org/mozilla-central/json-pushes?full=1&startID=0&endID=1

> Also, I think we could do a DB reset whenever we want and import the pushlog
> history at a second time. The only downside would be to have older
> resultsets in the database with a higher id, but I don't think this will be
> a problem

Hmm that isn't ideal (depending on whether downstream consumers make assumptions about a higher result set ID being more recent when analysing etc), but I guess it would be much better than having no data at all :-)
(In reply to Ed Morley [:edmorley] from comment #3)
> Hmm that isn't ideal (depending on whether downstream consumers make
> assumptions about a higher result set ID being more recent when analysing
> etc), but I guess it would be much better than having no data at all :-)

FWIW, I do occasionally make use of the numerical ID for coming up with regression ranges (makes it easy to get i.e. the 50 csets prior a rev as well). OTOH, Treeherder already has a better UI for that use case anyway, so maybe it's moot.
So data expiry came up on IRC - which affects the decision here, since if we're going to expire all data after 6 months, then the concern here is much reduced (since we're not going to be able to do long term analysis anyway).

A few questions:
1) How long is treeherder going to retain job data? (Mauro mentioned 6 months)
2) If we discard job data after 6 months, should we/do we need to keep pushlog data for longer?

If the answer to #2 is "no, 6 months too" then we can just wontfix this and move on :-)
I don't mind data only coming from June 2014. Going forward I would like to have a decent amount (>= 6 Months) worth of data available so that when I look at moving Futurama (and expanding my reporting) I will be able to get seasonal data
having full data for 6 months is great for talos regression hunting.  Some people want a full year.  Either way, until we have parity with tbpl for talos data, tree herder provides little value.  What that means is starting from June we will only be accumulating more data and be close if not past the 6 month mark for when tree herder is useful for talos data.

For the >6 months worth of data, is there value in creating a historical database for data mining purposes?
Wontfixing after chatting more in today's treeherder progress meeting.

Many use cases for analysis will either require job data too (in which case this bug won't help, since it's only about the pushlog, as it's not practical to import historic job data) or else won't need anything longer than the N months that we set data expiry to (currently 6 months). For anything else, by the time downstream consumers switch to treeherder, we'll have built up 6 months of historic data, which should be sufficient.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.