Closed Bug 1676712 Opened 4 years ago Closed 4 years ago

PingUploadWorker frozen to death for at least 16 hours due to regression caused by PR1240

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: florent, Assigned: janerik)

References

Details

Attachments

(2 files)

Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1315 4 years ago GitHub Bugzilla PR Linker 42 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1325 4 years ago GitHub Bugzilla PR Linker 42 bytes, text/x-github-pull-request		Details \| Review

Florent Viard

Reporter

Description

•

4 years ago

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0

Steps to reproduce:

I tried to pass the automatic tests of moz-phab with the instructions provided here:
https://github.com/mozilla-conduit/review
Basically, I tried to execute the following command:
venv/bin/pytest -vv tests/test_bmo.py

Actual results:

========================================================================== test session starts ==========================================================================
platform linux -- Python 3.8.5, pytest-5.2.1, py-1.8.0, pluggy-0.13.0 -- /.../venv/bin/python3
cachedir: .pytest_cache
rootdir: /.../review-mozphab, inifile: pytest.ini
collected 242 items

tests/test_bmo.py::test_get PASSED [ 0%]
tests/test_bmo.py::test_whoami

^C
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The second test will be stuck forever until I would do a ctrl+c.

In fact, it will be locked mozphab fixture_reset_glean that is trying to call the following Glean_sdk function:
glean.testing.reset_glean(application_id="mozphab", application_version="0.1.86")

And reset_glean will be stuck waiting for all "glean" process to terminate.
And it is stuck, because one process with PingUploadWorker will be stuck

Expected results:

PingUploadWorker should have returned in a few seconds maximum.
And so the subprocess would have been terminated, and reset_glean could return and not be stuck for half a day.

Florent Viard

Reporter

Comment 1

•

4 years ago

I think that this issue is a regression coming from the following Pull Request that was merged:
https://github.com/mozilla/glean/pull/1240

First of all, it looks like that its author didn't realize that the sleep function does not always take "milliseconds" as argument in all programming languages....
Python sleep function expects "seconds".
And I think that it is the same thing in swift.

So, in the end, if you ping "thingy" is not sent immediately after the sub-process is started, we will have to wait 60000 seconds (~16h) to have the first retry.

That being said, even if it was 60s for the first retry, it does not make any sense to wait so long for a first retry especially as before it was between 100ms to 1s depending on the language.

If it was wanted to have a longer backoff, at least it should be something like a progressive backoff: 0.1s, 1s, 5s.
But, in my opinion, having longer backoff delay is a really bad idea, as you are completely locked by the "sleep" for that duration.
And so, you will wait unconditionally and uselessly for the rest of the timer despite the task tag being now already in DONE or UPLOAD.

GitHub Bugzilla PR Linker

Comment 2

•

4 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1315 — Details

Jan-Erik Rediger [:janerik]

Assignee

Comment 3

•

4 years ago

Wow, I'm not sure how we missed the different in units (and here's my wish for types with units ...)

For the second thing: we should only sleep when the system returned wait, which is an explicit signal that we're throttled. We didn't use a sliding window yet, guess we actually should to make our lifes easier.
Will review the PR today and determine a way forward.

Florent Viard

Reporter

Comment 4

•

4 years ago

For the second thing, I understand that it can be decided to wait for a long time, but in that case, a "big" sleep is not a solution as it can't be interrupted.

Something that can be done for example, could be something like this:
if you want to wait for 60s:

next_try = now() + 60s
while now() < next_try:
    sleep(1s)
    if DONE:
        break

or something like python condition (mutex) could maybe be used.

Beatriz Rizental [:brizental]

Assignee: nobody → jrediger

Priority: -- → P1

GitHub Bugzilla PR Linker

Comment 5

•

4 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1325 — Details

Jan-Erik Rediger [:janerik]

Assignee

Updated

•

4 years ago

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

PingUploadWorker frozen to death for at least 16 hours due to regression caused by PR1240

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

Tracking

(Not tracked)

People

(Reporter: florent, Assigned: janerik)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Updated

Updated

Comment 5

Updated

Attachment

General

Description

File Name

Content Type