Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to skip to latest date for datasets we've already downloaded #31

Open
choldgraf opened this issue Feb 6, 2019 · 3 comments

Comments

@choldgraf
Copy link
Contributor

Something that I often do is the following:

  1. Check what's the latest date that I have data for in a given org/repo
  2. Choose the next day after the latest date
  3. Only download the remaining data so I don't double-up on data

I think it'd be useful if either:

  1. We made it easier to get the answer to 1, so that people could programmatically update data by choosing a date that doesn't make redundant API calls
  2. We added an option (e.g. "date='latest'") that will check for the last datetime in an org/repo and only download new data after that.

The goals of both of these would be to make it easier to incrementally update datasets instead of re-downloading them, since you start to his API limits when you download lots at once.

What do you think @NelleV ?

@NelleV
Copy link
Contributor

NelleV commented Feb 6, 2019

I don't think this will work reliably for mots objects.

@NelleV
Copy link
Contributor

NelleV commented Feb 6, 2019

Reading the documentation actually might. I've also never reached the rate limit, so I'm not sure it's worth spending much time on this unless we see it is becoming a problem.

@choldgraf
Copy link
Contributor Author

choldgraf commented Feb 6, 2019

From my perspective it's actually less about API limits, and more about time. E.g. if I already have the last 2 years of data from a project, then it would be useful for watchtower to do something like "hey, you've already got this data, I won't waste the time to re-download it since it's already there".

isn't the "date" field always either created_at or date? We only have 4 kinds of objects we need to care about so it shouldn't be that hard to handle the date for each one.

e.g., it could be a function that'd exist in each submodule (e.g. comments_) and would follow a pattern like this:

def update_commits_from_latest(org, repo):
    # Current data we've got
    current_comments = comments_.load_comments(org, repo)
    
    # Find last day of data
    last_date = current_comments['created_at'].max()

    # Two day overlap
    from_date = last_date - timedelta(days=2)

    # Update since that day
    comments_.update_comments(org, repo, since=from_date)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants