Skip to content

Praful932/MIDAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Open In Colab

MIDAS Lab Task-3 NLP

Contents

Files to Refer

  • The Repo works best in collab.
  • Notebook1 - Cleaning, EDA & Preparation for Modelling
  • Notebook2 - Modelling
  • Drive Folder
    • data.csv - Raw Dataset Provided.
    • processed_data.csv - Processed Dataset generated by Notebook1.
    • below_thresh_index.txt - Indexes of examples from dataset whose category was rare in the dataset, generated by Notebook1 . More details in the Notebook1 .
    • Models/Pretrained-bert - Saved Pretrained model generated by Notebook 2 if TRAIN = True and used for loading and inference in Notebook 2.

Models Used

  • Random Forest Classifer, Weighted F1 Score - 0.9764
  • DistilBert Uncased, Weighted F1 Score - 0.8970

Things tried & Further Improvements

  • In Notebook 1 - Preprocessing, for all the text features, lemmatization was tried using spacy, it was dropped as not much changes were seen due to the vocabulary & the pipeline took too much time to lemmatize >30 mins for ~20k samples.
  • The description feature was more of specification than a description with a semantic sense, so the product_specifications deemed more useful for fine-tuning a pretrained model for Sequence Classification.
  • Due to using TFidf for the 1st model, around ~47k features were generated, SparsePCA was tried to reduce it, since the dataset was too large, collab crashed. Since already The 1st model was giving a decent score, IncrementalPCA wasn't tried which could have overcome the memory issue.
  • For pretraining DistilBert was used which gave decent score with ~20% examples(Bert memory issue) and only one feature product_specifications(for the 2nd model) was used as it had a semantic order.
  • For both Random Forest & Seq Classification Weighted F1 Score is calculated to ensure imbalance of dataset is taken care of.
  • It is interesting to see predictions of both the model against discarded examples(those which did not have target). Amazing what Transfer Learning can do, with just 20 example for each category image
  • To improve the performance, Hyperparameter tuning can be done, the Pretrained model can be trained with more data.

References

About

MIDAS@IIITD NLP Task

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published