Skip to content

A complete project that fetches latest news and information from well-known online news websites and uses the outreach on social media as well as coverage by news agencies, to predict the degree to which the headline will get viral.

Notifications You must be signed in to change notification settings

Abhishek-612/Viral-News-Prediction

Repository files navigation

Viral-News-Prediction

A complete project that fetches latest news and information from well-known online news websites and uses the outreach on social media as well as coverage by news agencies, to predict the degree to which the headline will get viral.

The main.py file is the main script file. Constraints may have to be adjusted as per use.

The project is divided in three distinct components -

  1. News Crawler
  2. Detection of similar headlines
  3. Social Media Impact Prediction (Twitter)

[UPDATE]

In September 2023, Twitter revoked the Free Tier Access to its APIs allowing only the options to post/delete tweets, and no search/read options.

However, I have updated the code base to work with Twitter API v2.0. All you need is the BEARER_TOKEN and ACCESS_TOKEN (KEY and SECRET). I encourage any user with an API access to a higher tier to test out the twitter module, and raise appropriate issues, if you notice any. Thanks!

Topic Modelling is still functional, you may use it without any restrictions [requires NewsAPI API Key]. You may continue using the sample dataset from the repository instead, but the information was crawled in 2020.

Project Details

1. News Crawler

The project implements two different methods of scraping news from the internet. The first is performed using the News API. The second is a custom crawler but it can only fetch headlines from two websites currently, i.e.- NDTV and Times of India. However, it fetches news from all subsidaries of the domains, and will be updated soon for more. The custom crawler can adjust the number of pages to be scraped from the main.py file. The project uses both the crawling scripts since the latter has limited scope while the former may be limited by request rates of the API.

2. Detection of similar headlines

I used Natural Language Processing using a Word2Vec model. Due to the limitation on the text corpus from the crawled data, we use existing word embeddings from the Google News vectors which has pre-trained vectors for over 2 billion words. It is further trained with the vocabulary of the crawled text corpus. Adding to that, we create document vectors using the word embeddings for the text corpus. We use custom generated document vectors instead of Doc2Vec, since the combination of Word2Vec with this tweak provides a more pertinent result to our use case [Ref. 1].

IMPORTANT NOTE: The model may take a lot of time to train for the first time since the GoogleNews vector set is extremely large! However, there are alternatives to this:

  1. Add limit constraint to the build_w2v_model() function.
  2. Set the train_further constraint of the build_w2v_model() function to False. This will not train the new vocabulary from the crawled corpus, and will have less impact on the training time, but it will surely reduce.
  3. Use alternative environments such as Google Colab. Loading in environments like JetBrains PyCharm may cause a MemoryError due to the limited amount of memory allocated to the environment. So, you may want to run the main.py script from the Terminal/Command Prompt

3. Social Media Impact Prediction

I used Twitter as the platform to detect impact as it is known for public and sensitive discussions and hence gives a good analysis of their opinions, sentiments and impact. I used the Twitter data as a Time Series data of the number of tweets and retweets, on the subject of the crawled headlines, grouped hourly. Since headlines are recent, there isn't much data to make an accurate prediction. Hence, we use the Facebook's Prophet which has a robust forecasting model trained over long time data of user responses with seasonal and holiday effects. Twitter API also has a limit on the amount of tweets retrieved in a period, hence added TIME_SLEEP to sleep after each iteration, for seamless execution of all headlines. For immediate results, you may also adjust FETCH_TOP_NEWS to limit the number of headlines to be searched on Twitter.

Scoring

Since I used two different approaches and environments to analyse the prediction, we use a scoring metric to calculate the virality. The idea of using score assignment was inspired by Roja Bandari et al [Ref. 2]

The score assignment gives 30% weightage to the Word2Vec model and 70% to the Twitter data analysis for the next 7 days, since the scope of the former is limited to how many news sites cover the topic, while the latter provides an estimate of number of people who reacted to the topic.

Sample output in results directory

References

  1. https://towardsdatascience.com/using-word2vec-to-analyze-news-headlines-and-predict-article-success-cdeda5f14751
  2. https://www.hpl.hp.com/research/scl/papers/newsprediction/pulse.pdf

About

A complete project that fetches latest news and information from well-known online news websites and uses the outreach on social media as well as coverage by news agencies, to predict the degree to which the headline will get viral.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages