Open Bug 1956882 Opened 28 days ago Updated 25 days ago

pushflatpak doesn't wait long enough for flathub to process our upload

Categories

(Release Engineering :: Release Automation, defect)

defect

Tracking

(Not tracked)

People

(Reporter: jlorenzo, Unassigned)

References

Details

The apparent problem

Today, it took 6 attempts to turn YkxZ73_eSqeFhZEaHNu70Q green.

First attempt:

[...]
2025-03-27 13:40:59,973 - pushflatpakscript.flathub - INFO - Pushing the flatpak to the associated https://hub.flathub.org/api/v1/build/172951
2025-03-27 13:41:46,421 - pushflatpakscript.flathub - INFO - Commit-ing the flatpak to the associated https://hub.flathub.org/api/v1/build/172951
2025-03-27 13:42:03,525 - pushflatpakscript.flathub - INFO - Publishing the flatpak to the associated https://hub.flathub.org/api/v1/build/172951
Automation Error: python exited with signal -15

6th attempt:

[...]
2025-03-27 15:29:10,778 - pushflatpakscript.flathub - INFO - Pushing the flatpak to the associated https://hub.flathub.org/api/v1/build/172983
2025-03-27 15:29:22,175 - pushflatpakscript.flathub - INFO - Commit-ing the flatpak to the associated https://hub.flathub.org/api/v1/build/172983
2025-03-27 15:29:40,352 - pushflatpakscript.flathub - INFO - Publishing the flatpak to the associated https://hub.flathub.org/api/v1/build/172983
exit code: 0

Python signal -15 is the worker being shutdown. In the case of pushflatpak, we wait 20 minutes before killing the worker. The first 5 attempts were killed after 20 minutes.

This means, we don't wait long enough for the task to be completed.

The potential explanation

I chatted with bbhtt on #flatpaks and he explained that update-repo was holding a lock preventing our flatpak publications to be acknowledged. This mechanism is something Mozilla cannot control (i.e.: this wasn't caused by concurrent tasks pushing several versions of Firefox at the same time). Therefore, we need to make pushflatpak more resilient.

The missing information

Our flatpak jobs are not outputting enough logs. bbhtt pointed me to this example:

$ flat-manager-client -v publish --wait-update "$(cat publish_build.txt)"
Publishing build https://hub.flathub.org/api/v1/build/172122
Waiting for publish job
/ Job was started
| Running publish hook
| Importing build to repo stable
| generating org.freedesktop.Platform.ClInfo.flatpakref
[...]
| generating org.freedesktop.Platform.GlxInfo.flatpakref
| Queued repository update job 323356 in 120 secs
| Removing build 172122
\ Job completed successfully
Queued repo update job 323356

We are missing these logs from our tasks. From what I see here, these logs seem to be output to stdout. However, we should be displaying stdout. So, I'm not sure where the logs are swallowed up. In any case, this is something we should fix regarding the next point

The short term fix

Having better logs will improve the diagnosing of the problem. However, we need to increase the timeout on the pushflatpak workers. With our current retry mechanism, we are submitting more jobs to update-repo than necessary. Maybe we should do just like pushmsix and increase it to an hour. For the record, with 6 attempts, it took around 1h45min to get a green job.

The long term fix?

Even when update-repo finally accepted our submission, the latest version of Firefox was still not published. update-repo still has to process it and it's a long operation. bbhtt told me about --wait-update which increases the overall duration of the submission but ensures the latest submission was fully processed. I wonder if we should do something similar to our Apple notarization tasks where we have one that submits binaries and then another task that monitors when the binaries are processed. This would give Release Management better clarity on the state of the flatpak.

I'm not sure that the issue is that we're not waiting long enough tbh. The successful task only took 3 min to run, so I'm not sure waiting even longer would have made a difference.

The takeaways for me are:

  1. pushflatpakscript should be able to detect when something has gone wrong and bail out appropriately. I think we need more logging out of flathub-manager-client to know what needs to be done to this.
  2. The max_task_timeout should be lower than the polling interval in the pre-stop.sh script. I'm a bit fuzzy on how that script works, but task issues shouldn't be resulting in the worker getting shut down.
You need to log in before you can comment on or make changes to this bug.