Closed Bug 1894925 Opened 15 days ago Closed 8 days ago

waitfor for localstack in bin/test.sh sometimes fails in CI

Categories

(Socorro :: General, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

Details

Attachments

(1 file)

On Wednesday, May 1st in pr 6594, I updated localstack from 1.1.0 to 3.4.0. Since then, the socorro tests fail periodically when running in CircleCI in bin/test.sh at the waitfor line for localstack.

Example failure:

https://app.circleci.com/pipelines/github/mozilla-services/socorro/3161/workflows/fb414386-c8d7-4ef3-8079-bf1fa5f7d0d7/jobs/9259

make: 'my.env' is up to date.
[+] Running 105/5
 ✔ pubsub 6 layers [⣿⣿⣿⣿⣿⣿]      0B/0B      Pulled 57.0s  
 ✔ postgresql 13 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿]      0B/0B      Pulled 29.0s  
 ✔ statsd 39 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿]      0B/0B      Pulled 56.6s  
 ✔ elasticsearch 16 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿]      0B/0B      Pulled 34.5s  
 ✔ localstack 26 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿]      0B/0B      Pulled 57.4s  
[+] Running 8/8
 ✔ Network socorro_default            Created0.0s 
 ✔ Container socorro-localstack-1     Created0.6s  
 ✔ Container socorro-fakesentry-1     Created0.6s  
 ✔ Container socorro-postgresql-1     Created0.6s  
 ✔ Container socorro-gcs-emulator-1   Created0.6s  
 ✔ Container socorro-statsd-1         Created0.6s  
 ✔ Container socorro-elasticsearch-1  Created0.6s  
 ✔ Container socorro-pubsub-1         Created0.6s  
[+] Running 7/7
 ✔ Container socorro-elasticsearch-1  Started1.0s  
 ✔ Container socorro-statsd-1         Started1.1s  
 ✔ Container socorro-gcs-emulator-1   Started1.0s  
 ✔ Container socorro-localstack-1     Started1.1s  
 ✔ Container socorro-pubsub-1         Started1.1s  
 ✔ Container socorro-postgresql-1     Started0.8s  
 ✔ Container socorro-fakesentry-1     Started0.9s  
^@^@+ echo '>>> set up environment'
>>> set up environment
+ DATABASE_URL=postgres://postgres:postgres@postgresql:5432/socorro_test
+ SENTRY_DSN=http://public@fakesentry:8090/1
+ ELASTICSEARCH_URL=http://elasticsearch:9200
+ LOCAL_DEV_AWS_ENDPOINT_URL=http://localstack:4566/
+ export PYTHONPATH=/app/:/app
+ PYTHONPATH=/app/:/app
++ which pytest
+ PYTEST=/usr/local/bin/pytest
++ which python
+ PYTHON=/usr/local/bin/python
+ echo '>>> wait for services to be ready'
>>> wait for services to be ready
+ urlwait postgres://postgres:postgres@postgresql:5432/socorro_test
+ urlwait http://elasticsearch:9200
+ urlwait http://pubsub:5010 10
+ urlwait http://gcs-emulator:8001/storage/v1/b 10
+ python ./bin/waitfor.py --verbose --codes=200,404 http://public@fakesentry:8090/1
Testing http://fakesentry:8090/1 for [200, 404] with timeout 15...
+ python ./bin/waitfor.py --timeout 20 --verbose http://localstack:4566/
Testing http://localstack:4566/ for [200] with timeout 20...
Traceback (most recent call last):
  File "/app/./bin/waitfor.py", line 84, in <module>
    sys.exit(main(sys.argv[1:]))
             ^^^^^^^^^^^^^^^^^^
  File "/app/./bin/waitfor.py", line 64, in main
    with urllib.request.urlopen(url) as resp:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/request.py", line 1377, in http_open
    return self.do_open(http.client.HTTPConnection, req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/request.py", line 1352, in do_open
    r = h.getresponse()
        ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/http/client.py", line 1395, in getresponse
    response.begin()
  File "/usr/local/lib/python3.11/http/client.py", line 325, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/http/client.py", line 286, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: timed out

Exited with code exit status 1

On Thursday, we increased the waitfor timeout from the default 10s to 20s, but it's still failing in the same spot.

This issue covers figuring out why it's failing and fixing the issue.

One theory is that localstack hasn't finished starting up even after 20 seconds which would be weird, but maybe due to running out of resources or something. We could set it up so that the localstack startup logs are viewable and then look at that.

Another theory is that we're hitting a bug in waitfor where it's hitting a different timeout than the one we're setting to 20s.

Instead of tinkering with all that, we could ditch all the urlwait and waitfor stuff for what we did with Tecken where we moved the waiting/healthcheck bits to docker compose using depends_on.

In the meantime, "restart from start" seems to eventually work though it's a bit of a nuisance.

Grabbing this one because it's slowing down PRs.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

That fix has passed 3 CI runs in a row, so that's promising. I'll keep tabs on it.

I saw a whole bunch of CI runs pass, so I think we're good now.

CI has done fine since I landed the fix. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 8 days ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: