Closed Bug 1676712 Opened 4 years ago Closed 4 years ago

PingUploadWorker frozen to death for at least 16 hours due to regression caused by PR1240

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: florent, Assigned: janerik)

References

Details

Attachments

(2 files)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0

Steps to reproduce:

I tried to pass the automatic tests of moz-phab with the instructions provided here:
https://github.com/mozilla-conduit/review
Basically, I tried to execute the following command:
venv/bin/pytest -vv tests/test_bmo.py

Actual results:

========================================================================== test session starts ==========================================================================
platform linux -- Python 3.8.5, pytest-5.2.1, py-1.8.0, pluggy-0.13.0 -- /.../venv/bin/python3
cachedir: .pytest_cache
rootdir: /.../review-mozphab, inifile: pytest.ini
collected 242 items

tests/test_bmo.py::test_get PASSED [ 0%]
tests/test_bmo.py::test_whoami

^C
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The second test will be stuck forever until I would do a ctrl+c.

In fact, it will be locked mozphab fixture_reset_glean that is trying to call the following Glean_sdk function:
glean.testing.reset_glean(application_id="mozphab", application_version="0.1.86")

And reset_glean will be stuck waiting for all "glean" process to terminate.
And it is stuck, because one process with PingUploadWorker will be stuck

Expected results:

PingUploadWorker should have returned in a few seconds maximum.
And so the subprocess would have been terminated, and reset_glean could return and not be stuck for half a day.

I think that this issue is a regression coming from the following Pull Request that was merged:
https://github.com/mozilla/glean/pull/1240

First of all, it looks like that its author didn't realize that the sleep function does not always take "milliseconds" as argument in all programming languages....
Python sleep function expects "seconds".
And I think that it is the same thing in swift.

So, in the end, if you ping "thingy" is not sent immediately after the sub-process is started, we will have to wait 60000 seconds (~16h) to have the first retry.

That being said, even if it was 60s for the first retry, it does not make any sense to wait so long for a first retry especially as before it was between 100ms to 1s depending on the language.

If it was wanted to have a longer backoff, at least it should be something like a progressive backoff: 0.1s, 1s, 5s.
But, in my opinion, having longer backoff delay is a really bad idea, as you are completely locked by the "sleep" for that duration.
And so, you will wait unconditionally and uselessly for the rest of the timer despite the task tag being now already in DONE or UPLOAD.

Wow, I'm not sure how we missed the different in units (and here's my wish for types with units ...)

For the second thing: we should only sleep when the system returned wait, which is an explicit signal that we're throttled. We didn't use a sliding window yet, guess we actually should to make our lifes easier.
Will review the PR today and determine a way forward.

For the second thing, I understand that it can be decided to wait for a long time, but in that case, a "big" sleep is not a solution as it can't be interrupted.

Something that can be done for example, could be something like this:
if you want to wait for 60s:

next_try = now() + 60s
while now() < next_try:
    sleep(1s)
    if DONE:
        break

or something like python condition (mutex) could maybe be used.

See Also: → 1676853
See Also: → 1647630
Assignee: nobody → jrediger
Priority: -- → P1
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: