Open Bug 1628294 Opened 5 years ago Updated 4 years ago

Race condition in multiprocessing ping uploading

Categories

(Data Platform and Tools :: Glean: SDK, defect, P4)

defect

Tracking

(Not tracked)

People

(Reporter: mdroettboom, Unassigned)

Details

(Whiteboard: [telemetry:glean-rs:backlog])

It's currently possible for multiple processes to handle ping uploading at the same time.

The recent multiprocessing changes make this more likely to occur, but this has always technically been possible if multiple Python processes share the same Glean data directory, since the OS doesn't enforce a single master application process at one time as it does on mobile.

Because of this, I think we'll need implement the fix using filesystem-level locking, since we can't control, say, multiple instances of the same application running at the same time.

Assignee: nobody → mdroettboom

I am just realising that the new upload mechanism will hold the state of pending pings in memory, thus it could potentially even lead to duplicated pings getting sent. E.g. two processes start up and scan the directory at the same time, then go on in sending these out.

In general right now at least it works as expected: No duplicate pings, bit of logging, but no failure or crash.

After talking to :janerik in person -- given the known brokenness of filesystem locking in general, we should probably avoid that minefield.

:janerik -- can you elaborate on how the new upload mechanism avoids the race condition by loading everything into memory? If one process is deleting the files, while the other is loading into memory, wouldn't we have the same case? (That a file goes missing while we're still enumerating the directory?) Or would we just skip the file rather than throwing an error in that case?

Flags: needinfo?(jrediger)
Assignee: mdroettboom → nobody

That's the thing: the new upload mechanism might make this worse (for some definition of "worse", it's actually not that bad):

  1. On startup all files are read from disk and stored in memory: doc ID, submission path, ping payload.
    The files are left on disk.
  2. When an uploader asks for a ping it will get the data from memory, and submits it, then reports back the status
  3. If upload went successful the file is removed from disk.

Now if 2 processes run at the same time they will both upload the same ping and then try to remove the file. One will succeed, the other won't (the race condition).
Apart from the duplication at sending this shouldn't be a problem: we log an error and continue.

To avoid the race condition of multiple processes, we would need to coordinate those processes, not just files.
File locking could do that (first one to start "locks" the whole directory for its complete runtime, second one will not do any work, but then what happens if process 1 crashes and doesn't unlock? yup, exactly. we're stuck).

Without having a proper solution at hand, maybe it's fine if we ignore that for now? Maybe we just reduce the noise when we can't delete the file? We're not losing data, we're just getting more (but the pipeline might be able to de-dup them)

Flags: needinfo?(jrediger)

Editing bug description to remove Python: This is probably true for every platform where multiple ping uploaders could run at the same time (maybe iOS as in bug 1625157).

Summary: Python: Race condition in multiprocessing ping uploading → Race condition in multiprocessing ping uploading

(In reply to Jan-Erik Rediger [:janerik] from comment #3)

Without having a proper solution at hand, maybe it's fine if we ignore that for now? Maybe we just reduce the noise when we can't delete the file? We're not losing data, we're just getting more (but the pipeline might be able to de-dup them)

We should avoid sending duplicates as much as possible, but I see this problem being a problem mostly in Python, for now.

I can imagine that using lock/pid files may still be somehow fragile, right?

As :jan-erik mentioned above, moving to the new uploading mechanism that loads all pings into memory before sending helps with this somewhat (but not 100%).

I'm moving this to the backlog. To hit this the application would have to have multiple instances running at the same time and even then the worst case is duplicate pings being sent.

Priority: P1 → P3
Whiteboard: [telemetry:glean-rs:m13][glean-py] → [telemetry:glean-rs:backlog]
Priority: P3 → P4
You need to log in before you can comment on or make changes to this bug.