Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming #1

Open
laszewsk opened this issue Mar 27, 2020 · 0 comments
Open

streaming #1

laszewsk opened this issue Mar 27, 2020 · 0 comments

Comments

@laszewsk
Copy link
Contributor

we need an application that we run on cubernetes (we don't have anyone for that)

I was thinking about

https://www.dataquest.io/blog/streaming-data-python/

e.g. storing twitter data and analysing it in a kubernetes cluster

however we do not want to store it in SQL, but https://parquet.apache.org/

You would start with writing an organized program to just read the 1% twitter stream and work with me on a regular basis

I am interested in a histogram in case the tweets allow this.

Let's assume I store all tweets. Does the 1% stream include dletion events. What is the distribution of the time to live of deleted tweets

e.g. tweet x is posted on t_create and deleted on t_delete. The TTL = t_delete - t_create

we want a histogram of that

Then we put this in a kubernetes cluster. The first task can be done without a cluster. .... e.g. stream data, verify that delete events are there ...
So looks like there are many good examples on the twitter API

https://www.storybench.org/how-to-collect-tweets-from-the-twitter-streaming-api-using-python/

Maybe you can do such a simple program and make sure to write a README on how to run your program and set it up. I had lots of students that did such a project before easily in less than 2 hours, so this should not be an issue

I suggest we do

mkdir cloudmesh-twitter

cms sys command generate twitter .

cms twitter register REGISTER

some how register what we get from twitter

cms twitter stream start [--file=FILE]

starts the stream - just prints for now or if a file is presented a file is produced

cms twitter stream start [--file=FILE] [--attributes=ATTRIBUTES]. # filter not important as we want all tweets

only stores selected attributes of the tweet and not the entire tweet

cms twitter stream start [--file=FILE] [--filter=FILTER]. # filter not important as we want all tweets

if you start on things start not with parquet, but with the little prg

I will set this up in a repo in a could of minutes. start with learning how to get twitter API key and how to store it in ~/.cloudmesh

it does not have to be in cloudmesh.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant