PingUploadWorker frozen to death for at least 16 hours due to regression caused by PR1240
Categories
(Data Platform and Tools :: Glean: SDK, defect, P1)
Tracking
(Not tracked)
People
(Reporter: florent, Assigned: janerik)
References
Details
Attachments
(2 files)
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0
Steps to reproduce:
I tried to pass the automatic tests of moz-phab with the instructions provided here:
https://github.com/mozilla-conduit/review
Basically, I tried to execute the following command:
venv/bin/pytest -vv tests/test_bmo.py
Actual results:
========================================================================== test session starts ==========================================================================
platform linux -- Python 3.8.5, pytest-5.2.1, py-1.8.0, pluggy-0.13.0 -- /.../venv/bin/python3
cachedir: .pytest_cache
rootdir: /.../review-mozphab, inifile: pytest.ini
collected 242 items
tests/test_bmo.py::test_get PASSED [ 0%]
tests/test_bmo.py::test_whoami
^C
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The second test will be stuck forever until I would do a ctrl+c.
In fact, it will be locked mozphab fixture_reset_glean that is trying to call the following Glean_sdk function:
glean.testing.reset_glean(application_id="mozphab", application_version="0.1.86")
And reset_glean will be stuck waiting for all "glean" process to terminate.
And it is stuck, because one process with PingUploadWorker will be stuck
Expected results:
PingUploadWorker should have returned in a few seconds maximum.
And so the subprocess would have been terminated, and reset_glean could return and not be stuck for half a day.
Reporter | ||
Comment 1•4 years ago
|
||
I think that this issue is a regression coming from the following Pull Request that was merged:
https://github.com/mozilla/glean/pull/1240
First of all, it looks like that its author didn't realize that the sleep
function does not always take "milliseconds" as argument in all programming languages....
Python sleep
function expects "seconds".
And I think that it is the same thing in swift.
So, in the end, if you ping "thingy" is not sent immediately after the sub-process is started, we will have to wait 60000 seconds (~16h) to have the first retry.
That being said, even if it was 60s for the first retry, it does not make any sense to wait so long for a first retry especially as before it was between 100ms to 1s depending on the language.
If it was wanted to have a longer backoff, at least it should be something like a progressive backoff: 0.1s, 1s, 5s.
But, in my opinion, having longer backoff delay is a really bad idea, as you are completely locked by the "sleep" for that duration.
And so, you will wait unconditionally and uselessly for the rest of the timer despite the task tag being now already in DONE or UPLOAD.
Comment 2•4 years ago
|
||
Assignee | ||
Comment 3•4 years ago
|
||
Wow, I'm not sure how we missed the different in units (and here's my wish for types with units ...)
For the second thing: we should only sleep when the system returned wait
, which is an explicit signal that we're throttled. We didn't use a sliding window yet, guess we actually should to make our lifes easier.
Will review the PR today and determine a way forward.
Reporter | ||
Comment 4•4 years ago
|
||
For the second thing, I understand that it can be decided to wait for a long time, but in that case, a "big" sleep is not a solution as it can't be interrupted.
Something that can be done for example, could be something like this:
if you want to wait for 60s:
next_try = now() + 60s
while now() < next_try:
sleep(1s)
if DONE:
break
or something like python condition (mutex) could maybe be used.
Assignee | ||
Updated•4 years ago
|
Comment 5•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Description
•