pushflatpak doesn't wait long enough for flathub to process our upload
Categories
(Release Engineering :: Release Automation, defect)
Tracking
(Not tracked)
People
(Reporter: jlorenzo, Unassigned)
References
Details
The apparent problem
Today, it took 6 attempts to turn YkxZ73_eSqeFhZEaHNu70Q green.
First attempt:
[...]
2025-03-27 13:40:59,973 - pushflatpakscript.flathub - INFO - Pushing the flatpak to the associated https://hub.flathub.org/api/v1/build/172951
2025-03-27 13:41:46,421 - pushflatpakscript.flathub - INFO - Commit-ing the flatpak to the associated https://hub.flathub.org/api/v1/build/172951
2025-03-27 13:42:03,525 - pushflatpakscript.flathub - INFO - Publishing the flatpak to the associated https://hub.flathub.org/api/v1/build/172951
Automation Error: python exited with signal -15
6th attempt:
[...]
2025-03-27 15:29:10,778 - pushflatpakscript.flathub - INFO - Pushing the flatpak to the associated https://hub.flathub.org/api/v1/build/172983
2025-03-27 15:29:22,175 - pushflatpakscript.flathub - INFO - Commit-ing the flatpak to the associated https://hub.flathub.org/api/v1/build/172983
2025-03-27 15:29:40,352 - pushflatpakscript.flathub - INFO - Publishing the flatpak to the associated https://hub.flathub.org/api/v1/build/172983
exit code: 0
Python signal -15 is the worker being shutdown. In the case of pushflatpak, we wait 20 minutes before killing the worker. The first 5 attempts were killed after 20 minutes.
This means, we don't wait long enough for the task to be completed.
The potential explanation
I chatted with bbhtt on #flatpaks and he explained that update-repo
was holding a lock preventing our flatpak publications to be acknowledged. This mechanism is something Mozilla cannot control (i.e.: this wasn't caused by concurrent tasks pushing several versions of Firefox at the same time). Therefore, we need to make pushflatpak more resilient.
The missing information
Our flatpak jobs are not outputting enough logs. bbhtt pointed me to this example:
$ flat-manager-client -v publish --wait-update "$(cat publish_build.txt)"
Publishing build https://hub.flathub.org/api/v1/build/172122
Waiting for publish job
/ Job was started
| Running publish hook
| Importing build to repo stable
| generating org.freedesktop.Platform.ClInfo.flatpakref
[...]
| generating org.freedesktop.Platform.GlxInfo.flatpakref
| Queued repository update job 323356 in 120 secs
| Removing build 172122
\ Job completed successfully
Queued repo update job 323356
We are missing these logs from our tasks. From what I see here, these logs seem to be output to stdout. However, we should be displaying stdout. So, I'm not sure where the logs are swallowed up. In any case, this is something we should fix regarding the next point
The short term fix
Having better logs will improve the diagnosing of the problem. However, we need to increase the timeout on the pushflatpak workers. With our current retry mechanism, we are submitting more jobs to update-repo
than necessary. Maybe we should do just like pushmsix and increase it to an hour. For the record, with 6 attempts, it took around 1h45min to get a green job.
The long term fix?
Even when update-repo
finally accepted our submission, the latest version of Firefox was still not published. update-repo
still has to process it and it's a long operation. bbhtt told me about --wait-update
which increases the overall duration of the submission but ensures the latest submission was fully processed. I wonder if we should do something similar to our Apple notarization tasks where we have one that submits binaries and then another task that monitors when the binaries are processed. This would give Release Management better clarity on the state of the flatpak.
Comment 1•28 days ago
|
||
I'm not sure that the issue is that we're not waiting long enough tbh. The successful task only took 3 min to run, so I'm not sure waiting even longer would have made a difference.
The takeaways for me are:
- pushflatpakscript should be able to detect when something has gone wrong and bail out appropriately. I think we need more logging out of flathub-manager-client to know what needs to be done to this.
- The
max_task_timeout
should be lower than the polling interval in thepre-stop.sh
script. I'm a bit fuzzy on how that script works, but task issues shouldn't be resulting in the worker getting shut down.
Comment 2•25 days ago
|
||
https://bugzilla.mozilla.org/show_bug.cgi?id=1909593#c11 and following are related.
Description
•