Last Comment Bug 1258861 - Store log information in treeherder database, instead of generic per-project artifacts
: Store log information in treeherder database, instead of generic per-project ...
Status: RESOLVED FIXED
:
Product: Tree Management
Classification: Other
Component: Treeherder (show other bugs)
: ---
: Unspecified Unspecified
-- normal
: ---
Assigned To: William Lachance (:wlach) (use needinfo!)
:
:
Mentors:
Depends on: 1302224
Blocks: 1100414 thunder-try 1178641 1301477 1303069
  Show dependency treegraph
 
Reported: 2016-03-22 14:50 PDT by William Lachance (:wlach) (use needinfo!)
Modified: 2016-09-22 14:09 PDT (History)
4 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
[treeherder] wlach:1273231 > mozilla:master (47 bytes, text/x-github-pull-request)
2016-05-16 12:32 PDT, Autolander
no flags Details | Review | Splinter Review
[treeherder] wlach:1258861 > mozilla:master (47 bytes, text/x-github-pull-request)
2016-07-15 14:35 PDT, Autolander
james: review+
cdawson: feedback+
emorley: feedback+
Details | Review | Splinter Review
[treeherder] wlach:1258861-pre > mozilla:master (47 bytes, text/x-github-pull-request)
2016-08-30 13:41 PDT, Autolander
no flags Details | Review | Splinter Review
Stop storing text log artifacts + remove migration script (47 bytes, text/x-github-pull-request)
2016-09-14 11:09 PDT, William Lachance (:wlach) (use needinfo!)
emorley: review+
Details | Review | Splinter Review

Description User image William Lachance (:wlach) (use needinfo!) 2016-03-22 14:50:26 PDT

    
Comment 1 User image William Lachance (:wlach) (use needinfo!) 2016-03-22 15:24:48 PDT
We know what the structure of job logs are, we should just store them in the master database using the Django ORM, much as we do for performance artifacts, and retrieve them as needed using a proper API. This will make them easier to deal with, on both the client and server side. Proposed models:

```
class JobLogStep(models.Model):

    repository = models.ForeignKey(Repository)
    job_id = models.PositiveIntegerField(db_index=True)

    SUCCESS = 0
    SKIPPED = 1
    TEST_FAILED = 2

    RESULTS = ((SUCCESS, 'success'),
               (SKIPPED, 'skipped'),
               (TEST_FAILED, 'testfailed'))

    name = models.CharField(max_length=200)
    started = models.DateTimeField()
    started_linenumber = models.PositiveIntegerField()
    finished_linenumber = models.PositiveIntegerField()
    finished = models.DateTimeField()
    error_count = models.PositiveIntegerField()
    result = models.IntegerField(choices=RESULTS)

class JobLogStepError(models.Model):

    job_log_step = models.ForeignKey(JobLogStep)
    line = models.CharField(max_length=1024)
    line_number = models.PositiveIntegerField()
```

I can think of two ways of implementing this:

1. Implement the new model, migrate existing stuff in one shot (downside: log artifacts will be temporarily inaccessible while migration completes)
2. Implement the new model, but only ingest the new data, don't change anything else (and keep on creating artifacts as before). Add a migrated column for existing artifacts (to track what has been migrated, new artifacts are migrated by default), then start the migration. When everything done, switch over to using new data and delete the artifacts.

I guess it depends if we're willing to accept a limited period where some artifacts are inaccessible. Regardless, this should be relatively easy to prototype and implement.
Comment 2 User image Ed Morley [:emorley] 2016-03-23 04:27:58 PDT
Are steps a thing we still want to keep (at all, or at least in its current form)?
The taskcluster jobs only populate some of them properly (hence the dummy steps I had to implement in bug 1188444).
Comment 3 User image William Lachance (:wlach) (use needinfo!) 2016-03-23 08:48:58 PDT
(In reply to Ed Morley (Away 24th-28th March) [:emorley] from comment #2)
> Are steps a thing we still want to keep (at all, or at least in its current
> form)?
> The taskcluster jobs only populate some of them properly (hence the dummy
> steps I had to implement in bug 1188444).

I agree, it's debatable how useful steps are. They're only used by the log viewer, and I suspect the normal use there is to be able to jump to a particular error line. So we just could just consider killing them.
Comment 4 User image William Lachance (:wlach) (use needinfo!) 2016-03-23 10:49:37 PDT
Actually, now that I look at it, we already have a FailureLine model in treeherder which seems similar to my proposal above to JobLogStepError, but they don't seem to be associated with a particular job. James, would it maybe make sense to just extend your model to handle this behaviour?
Comment 5 User image James Graham [:jgraham] 2016-03-23 10:56:11 PDT
FailureLine is only structured lines at the moment. I guess it's possible to imagine a solution where we dump all log lines in there and somehow differentiate the structured and unstructured lines (this is not the way that the autoclassify stuff works right now, but it might be a reasonable refactor).

Things are associated with a job via job_guid. Using (repository, job_id) sounds appealing, but I think that composite primary keys and joining on composite keys is very poorly supported in django.
Comment 6 User image Autolander 2016-05-16 12:32:03 PDT Comment hidden (typo)
Comment 7 User image William Lachance (:wlach) (use needinfo!) 2016-05-16 12:36:10 PDT Comment hidden (typo)
Comment 8 User image William Lachance (:wlach) (use needinfo!) 2016-07-12 11:05:07 PDT
I am working on this now. It's both the next logical step for artifact migration and will be helpful for putting log steps in the database for analysis.
Comment 9 User image Autolander 2016-07-15 14:35:23 PDT
Created attachment 8771557 [details] [review]
[treeherder] wlach:1258861 > mozilla:master
Comment 10 User image William Lachance (:wlach) (use needinfo!) 2016-08-24 09:30:02 PDT
Comment on attachment 8771557 [details] [review]
[treeherder] wlach:1258861 > mozilla:master

Ok, I believe this is pretty much ready to go except:

(1) We need a migration script for migrating old logs
(2) We need to split this up into seperate commits so we can stage a migration in the usual manner (migrate old data into new tables, then switch the ingestion)

So this is actually more limited a refactoring than I'd like, but I kept it to more or less "just" changing the storage mechanism to the text log summary in the interests of:

* Not changing too much at once
* Getting datasource out of the tree sooner than later

In particular, there are still some intermediary data structures (TextLogSummary and TextLogSummaryLine) which should probably be merged with one of the new models I'm adding (TextLogError) at some future point. But that may be best done in the context of work on the autoclassification model, which I don't want to get into right now because of the above reasons. ^^^
Comment 11 User image Cameron Dawson [:camd] 2016-08-26 09:42:42 PDT
I haven't reviewed the code yet, but before I forget:

But wrt steps, it might be good to keep bug 1245760 in mind.   

tl;dr: there was a job that failed, but we got no failure lines (infra error).  The step itself had a failure code, but we didn't display it as a failure in the log viewer because none of the lines in the step matched a failure regex.  

So while I agree that getting rid of the step parser is probably a good call at this point, we should look for step lines that indicate an error of some kind.  Probably that's just adding a regex to catch it in the parser.
Comment 12 User image William Lachance (:wlach) (use needinfo!) 2016-08-26 09:57:43 PDT
(In reply to Cameron Dawson [:camd] from comment #11)
> So while I agree that getting rid of the step parser is probably a good call
> at this point, we should look for step lines that indicate an error of some
> kind.  Probably that's just adding a regex to catch it in the parser.

FWIW this patch is pretty limited in scope. We keep exactly the same data structure / elements that we had before, they're just now stored in the db with a schema instead of standalone job artifacts.

After discussing this with a lot of people, I think we do in fact want to keep the step parser in some form (and keep storing steps), so we have this data available for later analysis.
Comment 13 User image Cameron Dawson [:camd] 2016-08-26 14:47:36 PDT
Comment on attachment 8771557 [details] [review]
[treeherder] wlach:1258861 > mozilla:master

Since this is just a feedback request, I didn't go through EVERY file.  But this looks good to me.  I went through most of them.  I like the consolidation of the creation of the bug suggestions.
Comment 14 User image Autolander 2016-08-30 13:41:27 PDT
Created attachment 8786471 [details] [review]
[treeherder] wlach:1258861-pre > mozilla:master
Comment 15 User image William Lachance (:wlach) (use needinfo!) 2016-08-30 13:44:44 PDT
Based on previous experience, I think it makes sense to land this in two stages:

1. DB schema changes + migration file
2. Ingestion, webapp, mvc changes

Attached a pull request for just (1)
Comment 16 User image William Lachance (:wlach) (use needinfo!) 2016-08-31 13:35:46 PDT
Comment on attachment 8771557 [details] [review]
[treeherder] wlach:1258861 > mozilla:master

Ok, I'm pretty confident this is ready to go (using the staged deployment strategy I mentioned earlier). James, could you take a look at this when you have a moment? Ideally within the next couple of days (I was hoping to actually move production over to this next week, so I can move on to killing bug suggestion artifacts)

I've described some details of the changes (from the point of view of an external consumer) in the treeherder newsgroup:

https://groups.google.com/forum/#!topic/mozilla.tools.treeherder/iYSee3u2SZg
Comment 17 User image James Graham [:jgraham] 2016-09-01 12:16:29 PDT
Comment on attachment 8771557 [details] [review]
[treeherder] wlach:1258861 > mozilla:master

Looks good so far, have made some comments. Thanks for doing this!
Comment 18 User image Ed Morley [:emorley] 2016-09-06 06:50:43 PDT
We're now at the point where we're pretty much ready to switch stage to Heroku, however I think it will be best to land this prior to doing so (particularly prior to the start of the SCl3 prod -> heroku prod DB replication in bug 1283170).

Does landing this in the next 48 hours sound like something that is viable? :-)
Comment 19 User image William Lachance (:wlach) (use needinfo!) 2016-09-06 07:13:13 PDT
(In reply to Ed Morley [:emorley] from comment #18)
> We're now at the point where we're pretty much ready to switch stage to
> Heroku, however I think it will be best to land this prior to doing so
> (particularly prior to the start of the SCl3 prod -> heroku prod DB
> replication in bug 1283170).
> 
> Does landing this in the next 48 hours sound like something that is viable?
> :-)

For stage? I think it should be. I still need to address some of :jgraham's feedback (see the PR) but there don't seem to be any show-stoppers.

I'd probably want at least a couple days of testing on stage before actually deploying this to production.
Comment 20 User image Ed Morley [:emorley] 2016-09-06 07:13:37 PDT
Just seen this new cycle-data exception on stage:

django.db.utils:IntegrityError: (1451, 'Cannot delete or update a parent row: a foreign key constraint fails (`treeherder_stage`.`text_log_error`, CONSTRAINT `text_log_error_job_id_4b21fee19bc31759_fk_job_id` FOREIGN KEY (`job_id`) REFERENCES `job` (`id`))') 

https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/c6b5ca13-73ff-11e6-a90b-b82a72d22a14_0_7128
Comment 21 User image William Lachance (:wlach) (use needinfo!) 2016-09-06 07:24:40 PDT
(In reply to Ed Morley [:emorley] from comment #20)
> Just seen this new cycle-data exception on stage:
> 
> django.db.utils:IntegrityError: (1451, 'Cannot delete or update a parent
> row: a foreign key constraint fails (`treeherder_stage`.`text_log_error`,
> CONSTRAINT `text_log_error_job_id_4b21fee19bc31759_fk_job_id` FOREIGN KEY
> (`job_id`) REFERENCES `job` (`id`))') 
> 
> https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/
> c6b5ca13-73ff-11e6-a90b-b82a72d22a14_0_7128

Hmm, I think the problem here is that the job log error model has two foreign key references (the job log step and the job itself). We could either tweak the on delete behaviour or remove the job foreign key. I'll probably do the latter, since it shouldn't be necessary (the reference is in the job log step as well).
Comment 22 User image William Lachance (:wlach) (use needinfo!) 2016-09-07 14:13:52 PDT
(In reply to William Lachance (:wlach) from comment #21)
> (In reply to Ed Morley [:emorley] from comment #20)
> > Just seen this new cycle-data exception on stage:
> > 
> > django.db.utils:IntegrityError: (1451, 'Cannot delete or update a parent
> > row: a foreign key constraint fails (`treeherder_stage`.`text_log_error`,
> > CONSTRAINT `text_log_error_job_id_4b21fee19bc31759_fk_job_id` FOREIGN KEY
> > (`job_id`) REFERENCES `job` (`id`))') 
> > 
> > https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/
> > c6b5ca13-73ff-11e6-a90b-b82a72d22a14_0_7128
> 
> Hmm, I think the problem here is that the job log error model has two
> foreign key references (the job log step and the job itself). We could
> either tweak the on delete behaviour or remove the job foreign key. I'll
> probably do the latter, since it shouldn't be necessary (the reference is in
> the job log step as well).

Latest patch removes the foreign key, let's see how that works. :P
Comment 23 User image William Lachance (:wlach) (use needinfo!) 2016-09-07 14:14:57 PDT
Comment on attachment 8771557 [details] [review]
[treeherder] wlach:1258861 > mozilla:master

I think this is ready for another review.
Comment 24 User image Treeherder Bugbot 2016-09-08 11:51:46 PDT
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/e5a19c182f46ce0317cdeb3156d8c043a650007e
Bug 1258861 - Preparatory work for storing text log steps / errors in main db (#1822)

* Model updates, new db tables, migration script for text log db
Comment 25 User image James Graham [:jgraham] 2016-09-12 09:09:12 PDT
Comment on attachment 8771557 [details] [review]
[treeherder] wlach:1258861 > mozilla:master

\o/

Let's hope it works now :)
Comment 26 User image Treeherder Bugbot 2016-09-12 09:30:40 PDT
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/f0105d47f5cd72cbb8742025567c2ca17463f73e
Bug 1258861 - Move text log steps and errors into main database (#1696)

This changes ingestion, the API endpoints, and the frontend to match
the new structure. For now we continue to store text_log_summary artifacts,
though they don't do anything anymore.
Comment 27 User image William Lachance (:wlach) (use needinfo!) 2016-09-12 12:09:41 PDT
This is deployed to production, but leaving open for now. I'll mark as fixed when I've removed the migration script and transitional code.
Comment 28 User image William Lachance (:wlach) (use needinfo!) 2016-09-14 11:09:40 PDT
Created attachment 8791334 [details] [review]
Stop storing text log artifacts + remove migration script
Comment 29 User image Treeherder Bugbot 2016-09-14 13:14:52 PDT
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/9ffb8c49b5e7cf3fb81e6f7053883a99f5005154
Bug 1258861 - Stop creating text log summary artifacts

We only kept creating them so we could revert, but things are looking
happy on stage and production so I don't think we'll need to do that.

https://github.com/mozilla/treeherder/commit/a1f115e2dd11cea7d70839122b67faa69c5770df
Bug 1258861 - Remove script for migrating text log artifacts
Comment 30 User image William Lachance (:wlach) (use needinfo!) 2016-09-14 13:35:24 PDT
Going to call this done. Any further cleanup can be done in followups.

Note You need to log in before you can comment on or make changes to this bug.