Skip to content

I leveraged an algorithmic approach to predict the price and carat of the diamond using Machine Learning. Various regression models have been trained and their performance has been evaluated using the R Squared Score followed by tuning of the hyperparameters of top models. I have also carried out a trade-off based on the R Squared Score and the …

License

Notifications You must be signed in to change notification settings

hardikasnani/diamond-price-and-carat-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

diamond-price-and-carat-prediction

I leveraged an algorithmic approach to predict the price and carat of the diamond using Machine Learning. Various regression models have been trained and their performance has been evaluated using the R Squared Score followed by tuning of the hyperparameters of top models. I have also carried out a trade-off based on the R Squared Score and the Run-Time to take a situational decision to select the best model.

Research Question

It is often the case that a simple regression problem is solved that has a single independent variable being used to predict a dependent variable, or a multiple regression problem is solved that has several independent variables being used to predict a dependent variable. What about the problems that have more than one independent variable and more than one dependent variable? These are known as multivariate regression problems and this project aims to solve one by answering a research question - Using the given features, is it possible to predict both the price and carat of the diamond?

Dataset

A favorable dataset to answer the research questions effectively is the one that has at least two features that can be predicted. ‘Diamonds’ dataset is one such dataset that has multiple targets that can be predicted by regression. Those targets are ‘Price’ and ‘Carat’ of the diamond. The dataset also has a reasonable dimension i.e. (53940, 10). This means that it has information on 53940 round-cut diamonds. Each of the 53940 observations represents a different diamond. The dataset also has 10 features that describe each of those 53940 observations. The dataset is publicly available on Kaggle.

Regression

The performance metric used is the R Squared Score (R2 Score) to evaluate the performance of the regression models. R Squared Score is not the same as Accuracy in the case of a Classification task. It instead indicates what percentage of the variability in the dependent variables (here, price and carat of the diamond) can be explained by the model.

Initial Regression Models

All the five initial regression models are initialized with default parameters. After the training and testing of all the models, the R2 Score on the testing dataset establishes the Linear Regression model as the baseline model. It is also observed that the Lasso Regression model has the least R2 Score. To improve its performance, the alpha parameter is set to 0.0001 and this helps in getting its R2 Score slightly better than the baseline model. On comparing the final performance, the Random Forest Regression model outperforms all the other regression models and becomes the best model out of the initial regression models.

Hyperparameter Tuning

To improve the R2 Score of the Random Forest Regression model further, its most important hyperparameters are tuned. These parameters are n_estimators which indicate the number of trees in the forest and max_features which indicate the maximum number of features to be considered for splitting the node. Another hyperparameter called bootstrap is tuned that tells if the data points are sampled with or without replacement.

For tuning, a grid search with cross-validation is used. The list of values chosen for n_estimators is 75, 100, 125, and 150. The list of values chosen for max_features is 2, 4, 6, 8, and 10. The list of values chosen for bootstrap is True and False. Along with this, the cross-validation parameter is set to 5 indicating 5 folds. The parameter grid is supplied to the grid search, which outputs the best values for each hyperparameter - n_estimators: 125, max_features: 10, bootstrap: True. This gives a slight improvement in the R2 Score.

In the hope of getting an even better R2 Score, a new list of values is chosen for the hyperparameters. These new values are 125, 200, 250, 275 for n_estimators and 10, 12, 14, 16 for max_features. The list of values chosen for bootstrap and the cross-validation parameter value remains the same as before. The parameter grid is supplied to the grid search, which outputs the best values for each hyperparameter - n_estimators: 275, max_features: 14, bootstrap: True. This gives a slight improvement in the R2 Score again.

Trade-Off

Finally, the most important features of the improved model (after the second grid search) are reported and a new Random Forest Regression model is trained using just the most important features. However, there is a slight decrement in the R2 Score. Since the performance is not very different, the run time is tracked for both the models - the improved one (after the second grid search) having all 25 features and the most important feature model having just 10 most important features. The decision for the best model out of the two becomes the trade-off between the R2 Score and the Run Time.

Results

The baseline model is the least performant model as it just explains 90.34% (R2 Score) of the variability in the price and the carat of the diamond as compared to all the other models. The DecisionTree Regression model and the RandomForest Regression model with a testing dataset R2 score of 97.90% and 98.91%, respectively outperform the baseline model, Ridge Regression model, and the Lasso Regression model by a significant margin.

Just like the base RandomForest Regression model, the R2 Score is leveraged to evaluate the performance of the improved RandomForest Regression models. The improved models are tried with the best parameters that are given by the grid search with cross-validation. R2 Score standings based on the testing dataset illustrate that the RandomForest Regression model using Second Grid Search is the topmost performant while the base RandomForest Regression model is the least performant. The base RandomForest Regression model is the least performant model as it explains 98.91% (R2 Score) of the variability in the price and the carat of the diamond as compared to the improved models. The RandomForest Regression model using Second Grid Search and the RandomForest Regression model using First Grid Search with a testing data R2 score of 98.95% and 98.96%, respectively outperform the base RandomForest Regression model by a very small margin.

Out of all the features, the ones that are most important for predicting the price and the carat of the diamond are considered. It can be observed that with importance of 0.52, feature y (width) of the diamond is the most important feature for predicting the target variables i.e. price and carat of the diamond. The remaining dimensions of the diamond, which are x (length) and z (height), follow up in a second and third place with the importance of 0.27 and 0.11, respectively. With the importance of 0.02, the clarity SI2 of the diamond follows up next. With the importance of 0.01, the clarities such as VVS2, SI1, and I1 and, colors such as H, J, and I of the diamond are the ones that follow up at last.

R2 Score standings based on the testing dataset illustrate that the RandomForest Regression model using Second Grid Search is the topmost performant while the RandomForest Regression model with Most Important Features is the least performant. The RandomForest Regression with Most Important Features model is the least performant model as it explains 98.08% (R2 Score) of the variability in the price and the carat of the diamond as compared to the other models.

Though the performance decreases by a marginal amount by using a model with just the most important features, the run time decreases as well i.e. it takes less time to fit the training data and to test the model. The final decision as to how many features to keep is therefore a trade-off between the R2 Score and the Run Time of the models. On comparing the two models, it can be observed that the relative decrease in the R2 Score is 0.879% whereas the relative decrease in the Run Time is 76.76% when the features are reduced from 25 to 10. A higher R2 Score will come with all the features and a higher run time. Whereas a lower R2 Score will come with the 10 most important features and a lower run time. Hence, the decision for the best and the final model is a trade-off between R2 Score and Run Time.

Final Remarks

The various regression models trained and tested include Multivariate Linear Regression, Lasso Regression, Ridge Regression, Decision Trees, and Random Forest. The hyperparameters of the Random Forest Regression model are tuned using grid search with cross-validation. The performance of all the regression models is evaluated using the R2 Score. The highest R2 Score achieved is 98.96% for the model with all the 25 features and 98.09% for the model with the 10 most important features. Finally, the best model can be chosen using the trade-off between the R2 Score and the Run Time.

About

I leveraged an algorithmic approach to predict the price and carat of the diamond using Machine Learning. Various regression models have been trained and their performance has been evaluated using the R Squared Score followed by tuning of the hyperparameters of top models. I have also carried out a trade-off based on the R Squared Score and the …

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published