Skip to content

Comparison of Different Machine Learning Classification Algorithms for Breast Cancer Prediction

Notifications You must be signed in to change notification settings

shubamsumbria/Breast-Cancer-Pred

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comparative Analysis of Different Machine Learning Classification Algorithms for Breast Cancer Prediction

This is a Python- based implementation of Different Classification Algorithms on the task of Breast Cancer Prediction using Machine Learning.

Language and Libraries

Seaborn scikit_learn Seaborn numpy cplusplus Seaborn

About Dataset:

Breast Cancer Wisconsin (Diagnostic) Data Set from UCI ML Repository

Attribute Information:

  1. ID number
  2. Diagnosis (M = malignant, B = benign)
Ten real-valued features are computed for each cell nucleus(3-32):
  • radius (mean of distances from center to points on the perimeter)
  • texture (standard deviation of gray-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter^2 / area - 1.0)
  • concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension ("coastline approximation" - 1)

Creators:

  1. Dr. William H. Wolberg, General Surgery Dept. University of Wisconsin, Clinical Sciences Center Madison, WI 53792 wolberg '@' eagle.surgery.wisc.edu
  2. W. Nick Street, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 street '@' cs.wisc.edu 608-262-6619
  3. Olvi L. Mangasarian, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 olvi '@' cs.wisc.edu

Donor:

Nick Street

Exploratory Data Analysis

Checking Null and Missing Values

Null Values:
 diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

Missing Values:
 diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64
  • After checking various aspects like null values count, missing values count, and info. This dataset is perfect because of no Nul and missing values.

Information of dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    int64  
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  569 non-null    float64
 15  smoothness_se            569 non-null    float64
 16  compactness_se           569 non-null    float64
 17  concavity_se             569 non-null    float64
 18  concave points_se        569 non-null    float64
 19  symmetry_se              569 non-null    float64
 20  fractal_dimension_se     569 non-null    float64
 21  radius_worst             569 non-null    float64
 22  texture_worst            569 non-null    float64
 23  perimeter_worst          569 non-null    float64
 24  area_worst               569 non-null    float64
 25  smoothness_worst         569 non-null    float64
 26  compactness_worst        569 non-null    float64
 27  concavity_worst          569 non-null    float64
 28  concave points_worst     569 non-null    float64
 29  symmetry_worst           569 non-null    float64
 30  fractal_dimension_worst  569 non-null    float64
dtypes: float64(30), int64(1)
memory usage: 137.9 KB

Statistical Description of Data

Count Based On Diagnosis:

Observation: We have 357 malignant cases and 212 benign cases so our dataset is Imbalanced, we can use various re-sampling algorithms like under-sampling, over-sampling, SMOTE, etc. Use the “adequate” correct algorithm.

Correlation with Diagnosis:

Correlation of Mean Features with Diagnosis:

Observations:

  • fractal_dimension_mean least correlated with the target variable.
  • All other mean features have a significant correlation with the target variable.

Correlation of Squared Error Features with Diagnosis:

Observations:

  • texture_se, smoothness_se, symmetry_se, and fractal_dimension_se are least correlated with the target variable.
  • All other squared error features have a significant correlation with the target variable.

Correlation of Worst Features with Diagnosis:

  • Observation: All worst features have a significant correlation with the target variable.

Distribution based on Nucleus and Diagnosis:

Mean Features vs Diagnosis:

Squared Error Features vs Diagnosis:

Worst Features vs Diagnosis:

Checking Multicollinearity Between Distinct Features:

Mean Features:

Squared Error Features:

Worst Features:

Observations:

  • Almost perfectly linear patterns between the radius, perimeter, and area attributes are hinting at the presence of multicollinearity between these variables.
  • Another set of variables that possibly imply multicollinearity are the concavity, concave_points, and compactness.

Correlation Heatmap between Nucleus Feature:

Problem with having Multicollinearity Link

  • Observations: We can verify multicollinearity between some variables. This is because the three columns essentially contain the same information, which is the physical size of the observation (the cell). Therefore, we should only pick one of the three columns when we go into further analysis.

Things to remember while working with this dataset:

  • Slightly Imbalanced dataset (357 malignant cases and 212 benign cases). We have to select an adequate re-sampling algorithm for balancing.
  • Multicollinearity between some features.
  • As three columns essentially contain the same information, which is the physical size of the cell, we have to choose an appropriate feature selection method to eliminate unnecessary features.

Classifiers Used:

  1. Logistic Regression
  2. Decision Tree Classifier
  3. Random Forest Classifier
  4. K-Nearest Neighbors
  5. Linear SVM
  6. Kernal SVM
  7. Gaussian Naive Bayes

Releases

No releases published

Packages

No packages published

Languages