Make it possible to skip to latest date for datasets we've already downloaded #31

choldgraf · 2019-02-06T17:59:12Z

Something that I often do is the following:

Check what's the latest date that I have data for in a given org/repo
Choose the next day after the latest date
Only download the remaining data so I don't double-up on data

I think it'd be useful if either:

We made it easier to get the answer to 1, so that people could programmatically update data by choosing a date that doesn't make redundant API calls
We added an option (e.g. "date='latest'") that will check for the last datetime in an org/repo and only download new data after that.

The goals of both of these would be to make it easier to incrementally update datasets instead of re-downloading them, since you start to his API limits when you download lots at once.

What do you think @NelleV ?

NelleV · 2019-02-06T18:24:50Z

I don't think this will work reliably for mots objects.

NelleV · 2019-02-06T18:29:36Z

Reading the documentation actually might. I've also never reached the rate limit, so I'm not sure it's worth spending much time on this unless we see it is becoming a problem.

choldgraf · 2019-02-06T19:08:47Z

From my perspective it's actually less about API limits, and more about time. E.g. if I already have the last 2 years of data from a project, then it would be useful for watchtower to do something like "hey, you've already got this data, I won't waste the time to re-download it since it's already there".

isn't the "date" field always either created_at or date? We only have 4 kinds of objects we need to care about so it shouldn't be that hard to handle the date for each one.

e.g., it could be a function that'd exist in each submodule (e.g. comments_) and would follow a pattern like this:

def update_commits_from_latest(org, repo):
    # Current data we've got
    current_comments = comments_.load_comments(org, repo)
    
    # Find last day of data
    last_date = current_comments['created_at'].max()

    # Two day overlap
    from_date = last_date - timedelta(days=2)

    # Update since that day
    comments_.update_comments(org, repo, since=from_date)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it possible to skip to latest date for datasets we've already downloaded #31

Make it possible to skip to latest date for datasets we've already downloaded #31

choldgraf commented Feb 6, 2019

NelleV commented Feb 6, 2019

NelleV commented Feb 6, 2019

choldgraf commented Feb 6, 2019 •

edited

Make it possible to skip to latest date for datasets we've already downloaded #31

Make it possible to skip to latest date for datasets we've already downloaded #31

Comments

choldgraf commented Feb 6, 2019

NelleV commented Feb 6, 2019

NelleV commented Feb 6, 2019

choldgraf commented Feb 6, 2019 • edited

choldgraf commented Feb 6, 2019 •

edited