Open Bug 1654496 Opened 2 months ago Updated 10 hours ago

Update perfherder to ingest and present test information generated by multi commit builds

Categories

(Tree Management :: Perfherder, task, P1)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: Bebe, Assigned: igoldan)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(1 file)

For appLink tests we have the new multi commit tests runs

These tests represent a task-cluster run that generates test data for multiple fenix tests builds. This data is presented as multiple PERFHERDER_DATA entries in the log or multiple .json files in the artifact. Each entry represent a Fenix commit/build generated by the Fenix project.

Currently PerfHerder is ingesting this data as 'Mozilla-Central' branch datapoints and only the last datapoint is available:

Applink tests graph view

The scope of this task is to update Perfherder to correctly ingest this data.

We want each data point to point to the correct fenix branch commit
All data from the "multi-commit" build should be shown separate for each and every contained commit.

this is linked to https://jira.mozilla.com/browse/FXP-1201

Can you please triage and prioritize this task. It's vital for the Fenix Performance sheriffing

Flags: needinfo?(ajones)

Sarah in the last perfherder meeting you said you could help prioritize and gather more info about this task.
Any updates?

Flags: needinfo?(sclements)

I was told by ajones on Monday that Ionut was going to work on this task.

Flags: needinfo?(sclements)
Flags: needinfo?(ajones)
Assignee: nobody → igoldan
Assignee: igoldan → nobody

(In reply to Sarah Clements [:sclements] from comment #3)

I was told by ajones on Monday that Ionut was going to work on this task.

:igoldan is PTO until August 10th. My understanding was that :sclements and :igoldan would work together on this.

Yes, I'll provide any support or guidance as best I can :)

For many years now, perf datum were associated to a single job; that job is directly associated to a single push (from mozilla-central), which in turn is associated to a single push timestamp (the actual time when the push was made). Which is a normal assumption.
For our new use case in particular, each new perf datum is still associated to a single job, but these jobs are not necessarily directly associated to a single push (from mozilla-central), rather to multiple pushes from fenix repo. Fenix pushes which in turn are associated to distinct push timestamps.

I looked a bit into the pipeline code (residing in perf.py).

This piece of datum ingestion code is currently shadowing the ingestion of new data points pertaining to the same mozilla-central push.
Not entirely sure about this, but it's possible we can fix this shadowing issue by tweaking this datum ingestion line.
Instead of simply fetching & updating perf data based on its job's push's push_timestamp, we can consider to fetch & update it based on its specific fenix push's push_timestamp. If there is one; if there isn't (the old default behavior), we simply fallback to the old fetching & updating algorithm.

Basically, we would simply need to extend the fetch & update ability we currently have for our perf data.

From my experience, this seems the way to do it. Still, I'm not 100% sure about this & we could uncover some unexpected problems along the way (requiring either refactors or a different approach).
Going with the preliminary solution I proposed, the code changes aren't big. The gist of the fix would be to provide proper test coverage & QA on the staging environments before we release a change like this to production. I stress this out because the ingestion pipeline is the most important & yet intricate part of Perfherder/Treeherder.

Assignee: nobody → igoldan
Status: NEW → ASSIGNED
Priority: -- → P1
Depends on: 1658935
Depends on: 1659072
Depends on: 1660779
Blocks: 1660780

The PR is ready & working. Still, I am unable to deploy it on staging environment (not to mention production), because it seems the disk size allocated to the MySQL database isn't sufficient (more details in the PR's comment).

Unless we increase the storage size for the databases on treeherder-prototype, treeherder-staging & treeherder-production, this bug will remain blocked.

Flags: needinfo?(sclements)

The stage and prototype RDS instances are using the db.m5.xlarge while production is using db.m5.2xlarge. I believe this was something Armen had done a while ago as part of cost saving measures (reducing the size stage and prototype). I took a look in New Relic and there wasn't additional details, so it seems we'll be making a guess about how much extra disk space is needed.

Then there's also kind of the fact that we are increasing the RD instances so we can add an index to an already bloated table that is duplicating storage of data (since it's storing a years worth of performance data rather than utilizing active data). So I want to be sure we're looking at all angles.

Ionut, I'm curious how far along your Jira issue is to look at storing less in the TH performance_datum table and start using active data?

:dividehex, would you be able to offer any insight or suggestions regarding the upgrading of prototype and stage instances and what would be involved?

:camd, do you have any thoughts on this?

Flags: needinfo?(sclements) → needinfo?(jwatkins)

Also, from what I understand from the pr comments Ionut, the only reason we even need a schema change on the performance datum table is because you want to keep to create a temporary table with a foreign key on performance datum per https://github.com/mozilla/treeherder/pull/6710#discussion_r481867506, in case the multi-commit perf data changes don't work as expected.

So, we'd be increasing the RDS instances only for this reason. Could we perhaps find a workaround? Do you need to keep the data for a month in that table, or would running your pr on prototype2 for a week or so without that table (and the perf team could use prototype for testing temporarily) be sufficient for you to feel confident in the performance datum multi-commit changes?

(In reply to Sarah Clements [:sclements] from comment #9)

[...]
Ionut, I'm curious how far along your Jira issue is to look at storing less in the TH performance_datum table and start using active data?

I've only done the breakdown of this effort, to have an idea of what it will involve. But I don't know when I'll pick up this issue.
Dave, could we consider working on improving Perfherder's data retention in the 2020-October sprint?

Flags: needinfo?(dave.hunt)

(In reply to Sarah Clements [:sclements] from comment #10)

Also, from what I understand from the pr comments Ionut, the only reason we even need a schema change on the performance datum table is because you want to keep to create a temporary table with a foreign key on performance datum per https://github.com/mozilla/treeherder/pull/6710#discussion_r481867506, in case the multi-commit perf data changes don't work as expected.

So, we'd be increasing the RDS instances only for this reason. Could we perhaps find a workaround? Do you need to keep the data for a month in that table, or would running your pr on prototype2 for a week or so without that table (and the perf team could use prototype for testing temporarily) be sufficient for you to feel confident in the performance datum multi-commit changes?

This isn't quite accurate. The PR actually does 2 schema migrations: the one for the temporary table + another one for extending the unique constraint from performance_datum table (by adding the push_timestamp field into the equation). The temporary table migration went flawlessly, while the unique constraint change was the one that broke, because of storage limitations.

Thus, the problem here isn't the temporary table migration, but the unique constraint issue (which is mandatory for implementing this feature).

Ah, I see. Thanks for clarifying.

(In reply to Sarah Clements [:sclements] from comment #9)

:dividehex, would you be able to offer any insight or suggestions regarding the upgrading of prototype and stage instances and what would be involved?

I've gone down this road a few times and it plays out like this. Request to increase space on rds, I increase rds storage size, management sees next months aws bill, request to downgrade back to original size, I downgrade size (nervously). In 6 months to a year, rinse and repeat. Sorry if reply comes across a little rough, I mean no harm. I'm perfectly happy increasing the size of the rds instance storage BUT... the increased cost estimates must be approved up through management before proceeding.

Do you know what storage size you would like to target? Are you also looking to upgrade the instance type on staging and prototype to match prod?

/cc cshields since we are talking about increasing aws costs

Flags: needinfo?(jwatkins)

(In reply to Jake Watkins [:dividehex] from comment #14)

(In reply to Sarah Clements [:sclements] from comment #9)

:dividehex, would you be able to offer any insight or suggestions regarding the upgrading of prototype and stage instances and what would be involved?

I've gone down this road a few times and it plays out like this. Request to increase space on rds, I increase rds storage size, management sees next months aws bill, request to downgrade back to original size, I downgrade size (nervously). In 6 months to a year, rinse and repeat. Sorry if reply comes across a little rough, I mean no harm. I'm perfectly happy increasing the size of the rds instance storage BUT... the increased cost estimates must be approved up through management before proceeding.

Do you know what storage size you would like to target? Are you also looking to upgrade the instance type on staging and prototype to match > prod?

Thanks for the backstory :) Unfortunately I do not know how much more we need for those instances, which makes this also not an ideal solution. I was more wanting to know whats involved since I haven't been historically.

My preference at this point would be for the Treeherder and Perfherder team to come to a revised agreement about the max storage time for the performance_datum table (currently our second largest table at 19.2GB) and for them to use active data as necessary, as Ionut has mentioned above. I don't want to block on this bug, but the data retention for the perf team has been mentioned a few times in the past but never really reconciled. I'm hesitant to go through the whole song and dance of upgrading our RDS instances just for this purpose when I think the data retention issue should be addressed first (unless perhaps an argument can be made that the ROI to increase the RDS instances is worth the cost?).

As perf sheriffing team, monitoring Fenix nightly revisions is a top priority for us. This bug is an essential milestone for that.
Thus, we'd like to increase the database size for treeherder-prototype & treeherder-staging (no instance type upgrade needed), while also admitting that the data retention is a technical dept which needs to be take care off ASAP.

As a result, we're prioritizing the data retention issues, such that we can pick them up in the October-2020 sprint.

Thanks for prioritizing the data retention issue. I'll talk to jwalker (interim manager) about getting approval for a temporary/short term increase of prototype and stage to match production. Let's be sure we have a bug filed for decreasing the RDS sizes (do you have a bugzilla meta bug for this task aside from the Jira issue?) - along with increasing the amount of performance data removed from that table (via the cycle_data task) - once the other work that's involved is complete.

Fenix is important, and multi commit builds do help save a lot of money in machine/runtime- so overall this ends up being a win.

I agree we need to rethink our needs for long term data in Treeherder hot storage. We have a full replica in ActiveData and in BigQuery- so we are paying 3x for this storage (more if you could TH staging/dev and maybe ActiveData staging)

Is there a bug we can link to about fixing the large storage needs of perfherder data? I recall in bug 1649551 asking about adding another field to performance signature table, but then realized we create a new performance signature for every push- it seems as though that table should be 5% the size it currently is and has a lot of redundant data in it.

We have also cut our CI/tooling costs by over 50%, I think spending more with a plan to fix this in the next month (i.e. linked bugs assigned to people) should be a no brainer.

Alternatively we could increase the size for a period of time, and then decrease or delete instance after that time period is up.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #18)

[...]
Is there a bug we can link to about fixing the large storage needs of perfherder data? I recall in bug 1649551 asking about adding another field to performance signature table, but then realized we create a new performance signature for every push- it seems as though that table should be 5% the size it currently is and has a lot of redundant data in it.

There's nothing filed on Bugzilla yet. We filed the FXP-1315 - Address data retention in Perfherder epic on Jira, which we can mirror on Bugzilla.

Update: I just filed bug 1665857 as a meta for this.

See Also: → 1665857
Blocks: 1665859

(In reply to Sarah Clements [:sclements] from comment #17)

Thanks for prioritizing the data retention issue. I'll talk to jwalker (interim manager) about getting approval for a temporary/short term increase of prototype and stage to match production. Let's be sure we have a bug filed for decreasing the RDS sizes (do you have a bugzilla meta bug for this task aside from the Jira issue?)

I filed bug 1665859 for this.

  • along with increasing the amount of performance data removed from that table (via the cycle_data task) - once the other work that's involved is complete.

The meta bug 1665857 I mentioned in comment #19 should take care of that.

(In reply to Ionuț Goldan [:igoldan] from comment #11)

(In reply to Sarah Clements [:sclements] from comment #9)

[...]
Ionut, I'm curious how far along your Jira issue is to look at storing less in the TH performance_datum table and start using active data?

I've only done the breakdown of this effort, to have an idea of what it will involve. But I don't know when I'll pick up this issue.
Dave, could we consider working on improving Perfherder's data retention in the 2020-October sprint?

Yes, we can work on this starting in October, providing we have support from the Treeherder team.

Flags: needinfo?(dave.hunt)
You need to log in before you can comment on or make changes to this bug.