Closed Bug 1626490 Opened 6 years ago Closed 6 years ago

Probe Scraper DAG failing: AirflowException: Pod Launching failed: Pod returned a failure: failed

Categories

(Data Platform and Tools :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whd, Assigned: whd)

References

Details

Attachments

(1 file)

latest run

The last two days have failed. I initially attributed this to the GCP issues from yesterday but now I'm less sure. From the logs it's not clear to me whether this is an airflow issue or a problem with probe scraper. It appears that the pod launches then fails (suggesting it's an issue with probe scraper) but the logging isn't very descriptive and I'm not sure how to debug further.

:hwoo will take a look and assign to :frank if it's not an airflow infra issue.

Assignee: nobody → hwoo

:hwoo showed me how to find these logs, which aren't showing in the airflow UI due to a known issue. The actual issue with probe scraper is understood and :frank is taking care of it in https://bugzilla.mozilla.org/show_bug.cgi?id=1604919.

Assignee: hwoo → whd
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(fbertsch)
Resolution: --- → DUPLICATE

Can confirm, this hasn't been fixed yet. Tracked here: https://jira.mozilla.com/browse/EIS-1958

Flags: needinfo?(fbertsch)

:frank, we're continuing to see issues with probe scraper, even though that JIRA issue is marked as fixed. The logs indicate some sort of git issue which could be related but looks different from before:

Traceback (most recent call last): File "/usr/local/lib/python3.8/runpy.py", line 193, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.8/runpy.py", line 86, in _run_code exec(code, run_globals) File "/app/probe_scraper/runner.py", line 449, in <module> main(args.cache_dir, File "/app/probe_scraper/runner.py", line 377, in main load_glean_metrics(cache_dir, out_dir, repositories_file, dry_run, glean_repo) File "/app/probe_scraper/runner.py", line 208, in load_glean_metrics commit_timestamps, repos_metrics_data, emails = git_scraper.scrape(cache_dir, repositories) File "/app/probe_scraper/scrapers/git_scraper.py", line 139, in scrape ts, commits = retrieve_files(repo_info, folder) File "/app/probe_scraper/scrapers/git_scraper.py", line 77, in retrieve_files hashes = get_commits(repo, rel_path) File "/app/probe_scraper/scrapers/git_scraper.py", line 36, in get_commits change_commits = enumerate(repo.git.log(log_format, filename).split('\n')) File "/usr/local/lib/python3.8/site-packages/git/cmd.py", line 542, in <lambda> return lambda *args, **kwargs: self._call_process(name, *args, **kwargs) File "/usr/local/lib/python3.8/site-packages/git/cmd.py", line 1005, in _call_process return self.execute(call, **exec_kwargs) File "/usr/local/lib/python3.8/site-packages/git/cmd.py", line 822, in execute raise GitCommandError(command, status, stderr_value, stdout_value) git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
cmdline: git log --format="%H:%ct" toolkit/components/telemetry/fog/pings.yaml
stderr: 'fatal: ambiguous argument 'toolkit/components/telemetry/fog/pings.yaml': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]''

Can you take a look?

Status: RESOLVED → REOPENED
Flags: needinfo?(fbertsch)
Resolution: DUPLICATE → ---

Thanks :whd. It looks like fog took out their pings.yaml file. Jan-Erik, Chutten, where did we move this to?

Flags: needinfo?(jrediger)
Flags: needinfo?(fbertsch)
Flags: needinfo?(chutten)

Huh, we indeed did. I just did not realize this would fail the parser (but it's kinda obvious now).
FOG has been removed and therefore there is no pings.yaml anymore. I will remove that line from the config file.

That brings up the question: should probe-scraper hard-fail on these kinds of errors?

Flags: needinfo?(jrediger)
Flags: needinfo?(chutten)

We don't have a good data deletion policy in place, and it sounds like the current solution will result in the removal of the schemas from our schemas repository, which will then result in the removal of these datasets from bigquery. We should probably keep the last set of existing schemas in generated-schemas as they currently are (by whatever mechanism), until such a time as we have developed a better policy around source dataset deletion.

This said, if all relevant parties (including :mreid, whom I've CC'd) are fine with deleting this data (which sounds like prototype data), our deploy mechanisms will in fact delete (with manual operator approval) all source datasets for this data if the schemas are removed from the generated-schemas branch.

calling in :chutten for that, but I don't see why we should keep around the prototype data.

Flags: needinfo?(chutten)

Delete it all. It is of no use.

Flags: needinfo?(chutten)

:mreid, is the deletion of prototype data from production an acceptable course of action here?

Flags: needinfo?(mreid)

Yes, in this case deleting the data is acceptable. I don't think this is generally the case when removing an application, schema, or other artifact, let's revisit if/when there's another case.

Flags: needinfo?(mreid)

That brings up the question: should probe-scraper hard-fail on these kinds of errors?

We do want to hard-fail in the sense that the error is apparent and we deal with it. However we do not need to block schema deploys for all errors, what should happen is that the schema is simply not updated. How to handle this is a broader and more difficult discussion.

This specific issue (actually 2 issues) has been resolved. The broader and more difficult discussion is a subset of the general discussion of how to deal with data deletion, and can be tracked separately.

Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
See Also: → 1633928
Component: Scheduling → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: