since 2022-09-09, tecken nodes monotonically use disk and then run out of space
Categories
(Tecken :: Upload, defect, P1)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
We had a couple of out-of-disk-space errors today from the same tecken node. Looking into it further, if we zoom out, we see tecken nodes increase their disk usage over time.
Does Tecken clean up files correctly after successful and unsuccessful symbols upload? If not, do those files ever get cleaned up or do they sit around forever accumulating on the node over time?
Assignee | ||
Comment 1•2 years ago
|
||
Grabbing this to research for now. If there's something easy we can do for now, we should do that. Otherwise, we should wait for the upload code overhaul in bug #1652835.
Assignee | ||
Comment 2•2 years ago
|
||
The upload view is wrapped in a decorator that creates a temporary directory and removes it and all its contents if something fails. That seems fine.
The tecken app code runs in gunicorn using the sync worker (the default) with no max_requests
setting.
The timeout is set to 5 minutes. If the timeout is hit, then gunicorn will kill the process and that'll (probably) leave stuff around on disk. There isn't another process to clean things up. As near as I can tell, the temp directory prefix doesn't include the process id.
I think we shouldn't have a 5 minute timeout. Tecken upload requests are long and gunicorn shouldn't be killing them. The timeout should well exceed the other timeouts for the request (nginx, elb, etc).
I think for a first pass, I want to double the timeout.
Assignee | ||
Comment 3•2 years ago
|
||
We can probably emit a metric when gunicorn kills a process because of the timeout:
https://docs.gunicorn.org/en/stable/settings.html#worker-abort
If we add that first, we can probably use that to prove the hypothesis.
Assignee | ||
Comment 4•2 years ago
•
|
||
I fixed the graphs so the min is 0 and max is 110 so it's easier to eyeball disk usage over months. You can see monotonically increasing disk usage over time for all the nodes going back months. However, there aren't instances of "omg--ran out of disk!" until after 2022-09-09 or so. (I'll attach a screenshot.)
The recent deploy was 2022.09.01 (https://bugzilla.mozilla.org/show_bug.cgi?id=1788705). That had a handful of library updates, but nothing that would affect the Tecken app, so I don't currently believe that the update caused problems.
Rather, that deploy pushed out support for inline functions in Eliot and then I did a prod deploy for Socorro which added support for inline functions to the stackwalker. That unblocked the change to dump_syms to include inline information. We're pretty sure that symbol files for Firefox builds after that point are much larger. I don't know how that affects upload handling.
I'm looking at the upload data now to see what I can see in it. Can we see an increase in symbol upload payload sizes? Can we see an increase in failed upload attempts / timeouts? Etc.
Assignee | ||
Comment 5•2 years ago
|
||
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 6•2 years ago
|
||
I pulled down the upload records in Tecken and grouped them by created_at
day. We get a lot of symbol uploads from a variety of things, but the ones I wanted to look at are Firefox uploads and whether we can see data to answer these questions:
- Are uploads getting larger over time especially after we changed dump_syms to output inline function information?
- Are the number of incomplete symbol uploads increasing?
- Are the number of symbol upload retries increasing?
Statistics on the data for 2022-08-01 through 2022-09-15:
date | total | # > 1.5gb (%) | largest | large_mean | large_median | incomplete | retries |
---|---|---|---|---|---|---|---|
2022-08-01 | 368 | 7 (2%) | 1,576,903,409 |
1,567,346,745 |
1,565,997,640 |
0 | 50 ** |
2022-08-02 | 516 | 22 (4%) | 1,745,710,279 |
1,593,816,537 |
1,561,064,084 |
1 ** | 78 ** |
2022-08-03 | 397 | 6 (2%) | 1,737,134,926 |
1,606,841,206 |
1,570,585,792 |
0 | 53 ** |
2022-08-04 | 519 | 5 (1%) | 1,577,553,351 |
1,556,057,992 |
1,562,683,176 |
1 ** | 76 ** |
2022-08-05 | 242 | 7 (3%) | 1,745,912,052 |
1,582,702,699 |
1,562,650,105 |
0 | 34 |
2022-08-06 | 160 | 4 (2%) | 1,746,446,640 |
1,605,762,392 |
1,564,373,941 |
0 | 21 |
2022-08-07 | 134 | 3 (2%) | 1,727,490,725 |
1,612,724,496 |
1,562,831,666 |
0 | 24 |
2022-08-08 | 370 | 2 (1%) | 1,562,786,047 |
1,555,314,904 |
1,555,314,904 |
0 | 42 |
2022-08-09 | 515 | 10 (2%) | 1,746,567,744 |
1,597,133,634 |
1,566,503,841 |
0 | 68 ** |
2022-08-10 | 361 | 10 (3%) | 1,746,098,211 |
1,601,122,417 |
1,577,491,940 |
0 | 49 |
2022-08-11 | 553 | 23 (4%) | 1,756,919,860 |
1,607,231,930 |
1,577,198,021 |
0 | 79 ** |
2022-08-12 | 314 | 8 (3%) | 1,747,959,178 |
1,605,184,042 |
1,566,210,165 |
0 | 41 |
2022-08-13 | 165 | 7 (4%) | 1,581,368,389 |
1,574,916,091 |
1,581,368,389 |
2 ** | 39 |
2022-08-14 | 154 | 2 (1%) | 1,565,488,569 |
1,558,816,703 |
1,558,816,703 |
0 | 21 |
2022-08-15 | 380 | 8 (2%) | 1,751,433,548 |
1,615,560,198 |
1,575,999,559 |
0 | 54 ** |
2022-08-16 | 376 | 12 (3%) | 1,751,488,788 |
1,597,988,334 |
1,575,625,224 |
0 | 44 |
2022-08-17 | 287 | 13 (5%) | 1,763,678,614 |
1,581,321,682 |
1,566,678,881 |
0 | 41 |
2022-08-18 | 495 | 21 (4%) | 1,758,133,505 |
1,615,566,074 |
1,582,184,323 |
0 | 75 ** |
2022-08-19 | 291 | 5 (2%) | 1,732,428,538 |
1,586,308,615 |
1,551,288,918 |
0 | 40 |
2022-08-20 | 139 | 2 (1%) | 1,564,698,282 |
1,558,229,630 |
1,558,229,630 |
0 | 20 |
2022-08-21 | 91 | 2 (2%) | 1,564,708,633 |
1,558,278,964 |
1,558,278,964 |
0 | 10 |
2022-08-22 | 506 | 11 (2%) | 1,756,457,550 |
1,622,764,433 |
1,580,719,435 |
0 | 67 ** |
2022-08-23 | 514 | 20 (4%) | 1,761,337,101 |
1,596,614,213 |
1,566,674,656 |
1 ** | 80 ** |
2022-08-24 | 452 | 9 (2%) | 1,580,946,776 |
1,562,992,787 |
1,564,757,649 |
0 | 54 ** |
2022-08-25 | 415 | 6 (1%) | 1,580,827,951 |
1,550,335,272 |
1,558,500,759 |
0 | 50 ** |
2022-08-26 | 380 | 9 (2%) | 1,581,424,375 |
1,540,914,378 |
1,554,163,400 |
0 | 51 ** |
2022-08-27 | 138 | 3 (2%) | 1,566,849,056 |
1,541,293,599 |
1,556,111,623 |
0 | 24 |
2022-08-28 | 139 | 5 (4%) | 1,580,838,803 |
1,554,773,989 |
1,566,880,690 |
0 | 19 |
2022-08-29 | 652 | 28 (4%) | 1,757,737,547 |
1,567,978,328 |
1,564,860,868 |
2 ** | 103 ** |
2022-08-30 | 315 | 8 (3%) | 1,566,904,766 |
1,528,253,263 |
1,520,962,221 |
0 | 46 |
2022-08-31 | 307 | 9 (3%) | 1,757,695,313 |
1,563,202,801 |
1,560,336,790 |
0 | 34 |
2022-09-01 | 379 | 16 (4%) | 1,757,712,278 |
1,594,870,256 |
1,574,171,165 |
0 | 59 ** |
2022-09-02 | 351 | 22 (6%) | 1,880,381,770 ** |
1,644,686,469 |
1,584,237,154 |
2 ** | 55 ** |
2022-09-03 | 179 | 4 (2%) | 1,575,003,472 |
1,537,121,459 |
1,533,869,393 |
0 | 29 |
2022-09-04 | 135 | 4 (3%) | 1,575,039,782 |
1,537,046,045 |
1,533,780,616 |
0 | 22 |
2022-09-05 | 201 | 4 (2%) | 1,575,029,388 |
1,537,105,078 |
1,533,869,986 |
0 | 37 |
2022-09-06 | 429 | 21 (5%) | 1,755,485,569 |
1,591,760,028 |
1,569,847,829 |
0 | 64 ** |
2022-09-07 | 284 | 13 (5%) | 1,881,787,977 ** |
1,656,032,726 |
1,576,388,878 |
0 | 46 |
2022-09-08 | 402 | 64 (16%) | 2,093,423,277 ** |
1,651,129,629 |
1,598,509,649 |
5 ** | 71 ** |
2022-09-09 | 294 | 65 (22%) | 1,863,099,416 ** |
1,643,485,523 |
1,597,164,988 |
8 ** | 52 ** |
2022-09-10 | 220 | 56 (25%) | 2,093,424,033 ** |
1,652,239,917 |
1,612,249,382 |
3 ** | 36 |
2022-09-11 | 173 | 32 (18%) | 2,110,969,213 ** |
1,682,474,218 |
1,646,073,222 |
0 | 30 |
2022-09-12 | 363 | 54 (15%) | 1,864,537,012 ** |
1,636,018,866 |
1,608,426,648 |
3 ** | 55 ** |
2022-09-13 | 290 | 85 (29%) | 1,886,950,967 ** |
1,656,497,723 |
1,616,254,826 |
5 ** | 46 |
2022-09-14 | 264 | 65 (25%) | 1,886,903,353 ** |
1,647,921,235 |
1,608,794,966 |
8 ** | 34 |
2022-09-15 | 266 | 37 (14%) | 1,866,327,546 ** |
1,642,404,826 |
1,606,754,392 |
8 ** | 80 ** |
I don't know whether there are symbol upload attempts that fail that aren't represented in this data. If that's important (and it might be), we should capture metrics in the upload_symbols code in m-c.
However, I think the data we have in the uploads table lets us answer the questions I posed and also lets us see whether changes in Tecken help/hinder the situation.
-
Are uploads getting larger over time especially after we changed dump_syms to output inline function information?
The
size
field of theupload_uploads
table is the size of the data in themultipart/form-data
field forsymbols.zip
. An HTTP POST should only contain one zip file.Yes. The number of symbol uploads > 1.5gb increased after 2022-09-07 and the size of the largest symbol upload increased and occasionally exceeds 2gb.
-
Are the number of incomplete symbol uploads increasing?
An incomplete is when there's a
created_at
datestamp but nocompleted_at
datestamp. I'm pretty sure this happens when the upload times out in a way that kills the gunicorn process. I need to look into this more.Yes. Incompletes have happened before, but there's a clear increase in likelihood after 2022-09-07.
-
Are the number of symbol upload retries increasing?
Symbol upload requests have no helpful metadata (product, version, release channel, buildid, etc), so we have to invent a unique key. I'm using (email address, symbol upload size, date) as the key, so if two records with the same email address and symbol upload size occur on the same day, I'm treating that as a retry attempt.
Bug #1789414 increased the timeout for the upload_symbols job giving which gives it more time to do retries. That landed on 2022-09-06.
I think the data here is less clear. We had retries before and we continue to have retries, but it's not obviously increasing after 2022-09-06.
Assignee | ||
Comment 7•2 years ago
|
||
Assignee | ||
Comment 8•2 years ago
|
||
Assignee | ||
Comment 9•2 years ago
|
||
The gunicorn timeout change went out yesterday with bug #1791389.
I checked the dashboard and my data today. I don't see disk usage going up abnormally. There were two incompletes on 2022-09-19, but at least one of those happened before the deploy. The other might have happened either just before or during the deploy. There are no incompletes so far for 2022-09-20. Things look good so far....
Assignee | ||
Comment 10•2 years ago
|
||
I checked the dashboard and my data today.
Per the dashboard, the disk usage for all nodes is still under 40%. There's no case where disk usage for one node is going up abnormally. That's good.
Per upload data, there were 2 incompletes on 2022-09-19, 0 on 2022-09-20, and 1 on 2022-09-21 (today).
date | total | # > 1.5gb (%) | largest | large_mean | large_median | incomplete | retries | notes |
---|---|---|---|---|---|---|---|---|
2022-09-01 | 379 | 16 (4%) | 1,757,712,278 |
1,594,870,256 |
1,574,171,165 |
0 | 59 ** | |
2022-09-02 | 351 | 22 (6%) | 1,880,381,770 ** |
1,644,686,469 |
1,584,237,154 |
2 ** | 55 ** | |
2022-09-03 | 179 | 4 (2%) | 1,575,003,472 |
1,537,121,459 |
1,533,869,393 |
0 | 29 | |
2022-09-04 | 135 | 4 (3%) | 1,575,039,782 |
1,537,046,045 |
1,533,780,616 |
0 | 22 | |
2022-09-05 | 201 | 4 (2%) | 1,575,029,388 |
1,537,105,078 |
1,533,869,986 |
0 | 37 | |
2022-09-06 | 429 | 21 (5%) | 1,755,485,569 |
1,591,760,028 |
1,569,847,829 |
0 | 64 ** | |
2022-09-07 | 284 | 13 (5%) | 1,881,787,977 ** |
1,656,032,726 |
1,576,388,878 |
0 | 46 | |
2022-09-08 | 402 | 64 (16%) | 2,093,423,277 ** |
1,651,129,629 |
1,598,509,649 |
5 ** | 71 ** | |
2022-09-09 | 294 | 65 (22%) | 1,863,099,416 ** |
1,643,485,523 |
1,597,164,988 |
8 ** | 52 ** | |
2022-09-10 | 220 | 56 (25%) | 2,093,424,033 ** |
1,652,239,917 |
1,612,249,382 |
3 ** | 36 | |
2022-09-11 | 173 | 32 (18%) | 2,110,969,213 ** |
1,682,474,218 |
1,646,073,222 |
0 | 30 | |
2022-09-12 | 363 | 54 (15%) | 1,864,537,012 ** |
1,636,018,866 |
1,608,426,648 |
3 ** | 55 ** | |
2022-09-13 | 290 | 85 (29%) | 1,886,950,967 ** |
1,656,497,723 |
1,616,254,826 |
5 ** | 46 | |
2022-09-14 | 264 | 65 (25%) | 1,886,903,353 ** |
1,647,921,235 |
1,608,794,966 |
8 ** | 34 | |
2022-09-15 | 246 | 33 (13%) | 1,866,327,546 ** |
1,653,076,830 |
1,608,731,289 |
8 ** | 60 ** | |
2022-09-16 | 179 | 60 (34%) | 2,115,394,298 ** |
1,663,924,918 |
1,627,238,448 |
11 ** | 32 | |
2022-09-17 | 125 | 26 (21%) | 1,866,764,012 ** |
1,644,074,572 |
1,615,446,013 |
1 ** | 21 | |
2022-09-18 | 97 | 25 (26%) | 1,866,796,714 ** |
1,640,863,803 |
1,603,346,111 |
3 ** | 16 | |
2022-09-19 | 291 | 41 (14%) | 1,888,487,575 ** |
1,640,004,775 |
1,584,311,996 |
2 ** | 52 ** | timeout fix deployed |
2022-09-20 | 307 | 68 (22%) | 2,092,021,812 ** |
1,658,221,868 |
1,628,013,860 |
0 | 59 ** | |
2022-09-21 | 146 | 37 (25%) | 1,952,903,687 ** |
1,683,191,141 |
1,652,180,949 |
1 ** | 44 | partial day |
Incompletes can happen for reasons other than gunicorn killing the process and we have seen them in the past. For example, 2022-09-02. The fact that disk usage is normal is good. If that continues for the rest of the week, then we can close this out.
Assignee | ||
Comment 11•2 years ago
|
||
I continued to watch the graphs and upload data and everything continues to look fine. Marking as FIXED.
Description
•