streaming #1

laszewsk · 2020-03-27T12:31:57Z

we need an application that we run on cubernetes (we don't have anyone for that)

I was thinking about

https://www.dataquest.io/blog/streaming-data-python/

e.g. storing twitter data and analysing it in a kubernetes cluster

however we do not want to store it in SQL, but https://parquet.apache.org/

You would start with writing an organized program to just read the 1% twitter stream and work with me on a regular basis

I am interested in a histogram in case the tweets allow this.

Let's assume I store all tweets. Does the 1% stream include dletion events. What is the distribution of the time to live of deleted tweets

e.g. tweet x is posted on t_create and deleted on t_delete. The TTL = t_delete - t_create

we want a histogram of that

Then we put this in a kubernetes cluster. The first task can be done without a cluster. .... e.g. stream data, verify that delete events are there ...
So looks like there are many good examples on the twitter API

https://www.storybench.org/how-to-collect-tweets-from-the-twitter-streaming-api-using-python/

Maybe you can do such a simple program and make sure to write a README on how to run your program and set it up. I had lots of students that did such a project before easily in less than 2 hours, so this should not be an issue

I suggest we do

mkdir cloudmesh-twitter

cms sys command generate twitter .

cms twitter register REGISTER

some how register what we get from twitter

cms twitter stream start [--file=FILE]

starts the stream - just prints for now or if a file is presented a file is produced

cms twitter stream start [--file=FILE] [--attributes=ATTRIBUTES]. # filter not important as we want all tweets

only stores selected attributes of the tweet and not the entire tweet

cms twitter stream start [--file=FILE] [--filter=FILTER]. # filter not important as we want all tweets

if you start on things start not with parquet, but with the little prg

I will set this up in a repo in a could of minutes. start with learning how to get twitter API key and how to store it in ~/.cloudmesh

it does not have to be in cloudmesh.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streaming #1

streaming #1

laszewsk commented Mar 27, 2020

streaming #1

streaming #1

Comments

laszewsk commented Mar 27, 2020