Information Retrieval Course - Yelp Dataset Challenge

The repository contains the code set for the project for course ILS Z534 Information Retrieval offered during Fall 2014. The project had two tasks as explained below:

Task I

Categorize the business based on the review text. Idea was to follow Machine Learning Approach and use Multinomial Naive Bayes Classification algorithm as this is a multi label classification problem. Library used for this task was Weka Library.

Here, we use Bag-of-words model and Naive Bayes Algorithm assumes all the words are independent of each other. The features are the set of words which were chosen using TF-IDF and labels are the set of categories. The evaluation metrics used for this task is Precision and Recall. Root Mean Square Error was used to calculate the amount of deviation in predicting the categories. The training to test data set ratio was 80:20.

Worked on a data set that contained every unique business id with corresponding categories of Restaurants that belong to it and every review associated with that id. Followed the Machine Learning Approach and used Multinomial Naive Bayes Classification algorithm as this is a multi label classification problem. Library used for this task was Weka Library.

The features are the set of words which were chosen using TF-IDF and labels are the set of categories. The evaluation metrics used for this task is Precision and Recall. Root Mean Square Error was used to calculate the amount of deviation in predicting the categories. The training to test data set ratio was k fold cross validation with a factor of 5 across the data set.

Algorithm/Metric	Accuracy	Average Precision	Average Recall
Multinomial Naive Bayes	35.35%	0.33	0.353
Naive Bayes	32.34%	0.34	0.323
SMO (SVM using Poly Kernel)	30.41%	0.294	0.304
Decision Trees	27.41%	0.266	0.274

Task II

Try to predict the review rating from the review text. We try couple of approaches:

Under Supervised Learning, we use Multinomial Bayes, Support Vector Machine and Support Vector Regression algorithms to classify the text based on the reviews. These algorithms are available as part of SciKit python library.
Next approach under Supervised Learning, was to use NaiveBayesAnalyzer which is available as part of TextBlob python library. TextBlob library is built on top of NLTK. The NaiveBayesAnalyzer is trained using 1000 positive movie reviews and 1000 negative reviews. It was surprising to see that accuracy was more when we used the model based on movie reviews.
The last approach is a kind of un-supervised learning approach where we try to use SentiWordNet and Stanford NLP. Stanford NLP is used for POS tagging. Once the text is tagged, SentiWordNet is used to get the sentiment score based on the POS and the word. We then choose the top 20 words based on the sentiment score and come up with the sentiment of the text. Based on this sentiment score, the review is categorized into ratings from 1 to 5.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Docs		Docs
TaskI		TaskI
TaskII		TaskII
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs

Docs

TaskI

TaskI

TaskII

TaskII

README.md

README.md

Repository files navigation

Information Retrieval Course - Yelp Dataset Challenge

Task I

Task II

About

Releases

Packages

Languages

avaradha/InfoRetrievalYelpChallenge

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval Course - Yelp Dataset Challenge

Task I

Task II

About

Resources

Stars

Watchers

Forks

Languages