I will be analyzing a dataset about Breast Cancer remission. The target variable is predicting remission, so we will be performing binary classification. The data is categorical in nature.

Win condition: AUROC Score of 80% or higher

Here is the data citation:

Title: Breast cancer data (Michalski has used this)

Sources:
– Matjaz Zwitter & Milan Soklic (physicians)
Institute of Oncology
University Medical Center
Ljubljana, Yugoslavia
– Donors: Ming Tan and Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
– Date: 11 July 1988

Import Libraries

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_style('darkgrid')
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

from sklearn.exceptions import NotFittedError
from sklearn.metrics import roc_auc_score

Exploratory Analysis and Data Wrangling

From the included .names file:

Number of Instances: 286

Number of Attributes: 9 + the class attribute

Attribute Information:

  1. Class: no-recurrence-events, recurrence-events
  2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
  3. menopause: lt40, ge40, premeno.
  4. tumor_size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
  5. inv_nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
  6. node-caps: yes, no.
  7. deg_malig: 1, 2, 3.
  8. breast: left, right.
  9. breast_quad: left-up, left-low, right-up, right-low, central.
  10. irradiat: yes, no.

Missing Attribute Values: (denoted by “?”) Attribute #: Number of instances with missing values:
node_caps: 8
breast_quad: 1

Class Distribution: 1. no-recurrence-events: 201 instances 2. recurrence-events: 85 instances

# Importing data
df = pd.read_csv('breast-cancer.data', header=None, names=['target', 'age', 'menopause', 'tumor_size', 'inv_nodes', 'node_caps', 'deg_malig', 'breast', 'breast_quad', 'irradiat'])
df.head()
target age menopause tumor_size inv_nodes node_caps deg_malig breast breast_quad irradiat
0 no-recurrence-events 30-39 premeno 30-34 0-2 no 3 left left_low no
1 no-recurrence-events 40-49 premeno 20-24 0-2 no 2 right right_up no
2 no-recurrence-events 40-49 premeno 20-24 0-2 no 2 left left_low no
3 no-recurrence-events 60-69 ge40 15-19 0-2 no 2 right left_up no
4 no-recurrence-events 40-49 premeno 0-4 0-2 no 2 right right_low no
df.dtypes
target         object
age            object
menopause      object
tumor_size     object
inv_nodes      object
node_caps      object
deg_malig       int64
breast         object
breast_quad    object
irradiat       object
dtype: object
df.hist()
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a16401e10>]],
      dtype=object)

png

#change how missing data is labeled
mask_1 = df['node_caps'].str.contains('\?')
df.loc[mask_1,'node_caps'] = 'missing'
df['node_caps_missing'] = mask_1
df[mask_1]
target age menopause tumor_size inv_nodes node_caps deg_malig breast breast_quad irradiat node_caps_missing
145 no-recurrence-events 40-49 premeno 25-29 0-2 missing 2 left right_low yes True
163 no-recurrence-events 60-69 ge40 25-29 3-5 missing 1 right left_up yes True
164 no-recurrence-events 60-69 ge40 25-29 3-5 missing 1 right left_low yes True
183 no-recurrence-events 50-59 ge40 30-34 9-11 missing 3 left left_up yes True
184 no-recurrence-events 50-59 ge40 30-34 9-11 missing 3 left left_low yes True
233 recurrence-events 70-79 ge40 15-19 9-11 missing 1 left left_low yes True
263 recurrence-events 50-59 lt40 20-24 0-2 missing 1 left left_up no True
264 recurrence-events 50-59 lt40 20-24 0-2 missing 1 left left_low no True
mask_2 = df['breast_quad'].str.contains('\?')
df.loc[mask_2,'breast_quad'] = 'missing'
df['breast_quad_missing'] = mask_2
df[mask_2]
target age menopause tumor_size inv_nodes node_caps deg_malig breast breast_quad irradiat node_caps_missing breast_quad_missing
206 recurrence-events 50-59 ge40 30-34 0-2 no 3 left missing no False True
df['node_caps_missing'] = df['node_caps_missing'].astype(int)
df['breast_quad_missing'] = df['breast_quad_missing'].astype(int)
df.describe(include='object')
target age menopause tumor_size inv_nodes node_caps breast breast_quad irradiat
count 286 286 286 286 286 286 286 286 286
unique 2 6 3 11 7 3 2 6 2
top no-recurrence-events 50-59 premeno 30-34 0-2 no left left_low no
freq 201 96 150 60 213 222 152 110 218
for column in df.dtypes[df.dtypes == 'object'].index:
    sns.countplot(y=df[column])
    plt.show()

png

png

png

png

png

png

png

png

png

There are sparse classes in these features: age, menopause, tumor_size, inv_nodes, breast_quad.

Because age, tumor_size, and inv_nodes are ordinal features we can handle the issue of sparse classes by changing those three features into integer values.

We can handle menopause feature by changing it to a binary feature: premenopause or not.

For the breast_quad data I’m going to combine the right_up, right_low, and central classes into a single ‘other’ class.

# Change age to an ordinal feature
df.age.unique()
array(['30-39', '40-49', '60-69', '50-59', '70-79', '20-29'], dtype=object)
age_map = {'20-29':1, '30-39':2, '40-49':3, '50-59':4, '60-69':5, '70-79':6}
df['age'] = df.age.replace(age_map)
df.age.head()
0    2
1    3
2    3
3    5
4    3
Name: age, dtype: int64
# Change tumor_size to an ordinal feature
df.tumor_size.unique()
array(['30-34', '20-24', '15-19', '0-4', '25-29', '50-54', '10-14',
       '40-44', '35-39', '5-9', '45-49'], dtype=object)
tumor_size_map = {'0-4':1, '5-9':2, '10-14':3,'15-19':4,'20-24':5,'25-29':6,'30-34':7,'35-39':8,'40-44':9,'45-49':10,'50-54':11}
df['tumor_size'] = df.tumor_size.replace(tumor_size_map)
df.tumor_size.head()
0    7
1    5
2    5
3    4
4    1
Name: tumor_size, dtype: int64
# change inv_nodes to ordinal feature
df.inv_nodes.unique()
array(['0-2', '6-8', '9-11', '3-5', '15-17', '12-14', '24-26'],
      dtype=object)
inv_nodes_map = {'0-2':1, '3-5':2,'6-8':3, '9-11':4, '12-14':5, '15-17':6, '18-20':7, '21-23':8, '24-26':9}
df['inv_nodes'] = df.inv_nodes.replace(inv_nodes_map)
df.inv_nodes.head()
0    1
1    1
2    1
3    1
4    1
Name: inv_nodes, dtype: int64
df.hist(figsize=(10,10))
plt.show()

png

df['premeno'] = df.menopause=='premeno'
df['premeno'] = df.premeno.astype(int)
df.premeno.head()
df.drop(columns=['menopause'], inplace=True)
0    1
1    1
2    1
3    0
4    1
Name: premeno, dtype: int64
df.head()
target age tumor_size inv_nodes node_caps deg_malig breast breast_quad irradiat node_caps_missing breast_quad_missing premeno
0 no-recurrence-events 2 7 1 0 3 left left_low 0 0 0 1
1 no-recurrence-events 3 5 1 0 2 right right_up 0 0 0 1
2 no-recurrence-events 3 5 1 0 2 left left_low 0 0 0 1
3 no-recurrence-events 5 4 1 0 2 right left_up 0 0 0 0
4 no-recurrence-events 3 1 1 0 2 right right_low 0 0 0 1
# change yes and no classes to integer values
df['node_caps'] = df.node_caps.replace({'no':0, 'yes':1})
df['irradiat'] = df.irradiat.replace({'no':0, 'yes':1})
# rename ordinal features
rename_ordinal_map = {
    'age' : 'age_ordinal',
    'tumor_size' : 'tumor_size_ordinal',
    'inv_nodes' : 'inv_nodes_ordinal'
}
df.rename(columns = rename_ordinal_map, inplace=True)
# rename target and change to 0s and 1s
df.rename(columns={'target':'no_recurrence'}, inplace=True)
df['no_recurrence'] = df.no_recurrence.replace({'no-recurrence-events':1, 'recurrence-events':0})
# replacing sparse classes in breast_quad with 'other'
df.breast_quad.unique()
array(['left_low', 'right_up', 'left_up', 'right_low', 'central',
       'missing'], dtype=object)
df.replace(['right_up','right_low', 'central'], 'other', inplace=True)
df.head()
no_recurrence age_ordinal tumor_size_ordinal inv_nodes_ordinal node_caps deg_malig breast breast_quad irradiat node_caps_missing breast_quad_missing premeno
0 1 2 7 1 0 3 left left_low 0 0 0 1
1 1 3 5 1 0 2 right other 0 0 0 1
2 1 3 5 1 0 2 left left_low 0 0 0 1
3 1 5 4 1 0 2 right left_up 0 0 0 0
4 1 3 1 1 0 2 right other 0 0 0 1
df.dtypes
no_recurrence           int64
age_ordinal             int64
tumor_size_ordinal      int64
inv_nodes_ordinal       int64
node_caps              object
deg_malig               int64
breast                 object
breast_quad            object
irradiat                int64
node_caps_missing       int64
breast_quad_missing     int64
premeno                 int64
dtype: object
# since get dummies will handle the missing columns I'm going to remove them first so that I don't have duplicate data
df.drop(columns = ['node_caps_missing', 'breast_quad_missing'], inplace=True)
df.head()
no_recurrence age_ordinal tumor_size_ordinal inv_nodes_ordinal node_caps deg_malig breast breast_quad irradiat premeno
0 1 2 7 1 0 3 left left_low 0 1
1 1 3 5 1 0 2 right other 0 1
2 1 3 5 1 0 2 left left_low 0 1
3 1 5 4 1 0 2 right left_up 0 0
4 1 3 1 1 0 2 right other 0 1
abt = pd.get_dummies(df, columns=['node_caps', 'breast', 'breast_quad'])
abt.head()
no_recurrence age_ordinal tumor_size_ordinal inv_nodes_ordinal deg_malig irradiat premeno node_caps_0 node_caps_1 node_caps_missing breast_left breast_right breast_quad_left_low breast_quad_left_up breast_quad_missing breast_quad_other
0 1 2 7 1 3 0 1 1 0 0 1 0 1 0 0 0
1 1 3 5 1 2 0 1 1 0 0 0 1 0 0 0 1
2 1 3 5 1 2 0 1 1 0 0 1 0 1 0 0 0
3 1 5 4 1 2 0 0 1 0 0 0 1 0 1 0 0
4 1 3 1 1 2 0 1 1 0 0 0 1 0 0 0 1

Training and Evaluating Models

y = abt.no_recurrence
X = abt.drop(['no_recurrence'], axis=1)
X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=.33, random_state=123)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(191, 15) (95, 15) (191,) (95,)
# creating pipelines for logistic regression using l1 and l2 regularization
# creating pipelines for randomforest and gradientboosting classifiers
pipelines = {
    'l1': make_pipeline(StandardScaler(), LogisticRegression(penalty='l1', random_state=123)),
    'l2': make_pipeline(StandardScaler(), LogisticRegression(penalty='l2', random_state=123)),
    'rf': make_pipeline(StandardScaler(), RandomForestClassifier(random_state=123)),
    'gb': make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=123))
}
# setting hyperparameters for l1 and l2 logistic regression
l1_hyperparameters = {'logisticregression__C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000]}
l2_hyperparameters = {'logisticregression__C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000]}
rf_hyperparameters = {
    'randomforestclassifier__n_estimators': [100, 200],
    'randomforestclassifier__max_features': ['auto', 'sqrt', 0.33, 0.5],
    'randomforestclassifier__min_samples_leaf': [1, 3, 5, 10]
}
gb_hyperparameters = {
    'gradientboostingclassifier__n_estimators': [100, 200],
    'gradientboostingclassifier__learning_rate': [0.05, 0.1, 0.2],
    'gradientboostingclassifier__max_depth': [1, 3, 5]
}
# setting up hyperparameters dictionary for pipeline workflow
hyperparameters = {
    'l1':l1_hyperparameters,
    'l2':l2_hyperparameters,
    'rf':rf_hyperparameters,
    'gb':gb_hyperparameters
}
fitted_models = {}

for name, pipeline in pipelines.items():
    model = GridSearchCV(pipeline, hyperparameters[name], cv=10, n_jobs=-1)
    model.fit(X_train, y_train)
    fitted_models[name] = model
    print(name, 'has been fitted.')
/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


l1 has been fitted.


/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


l2 has been fitted.


/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)


rf has been fitted.
gb has been fitted.


/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
for name, model in fitted_models.items():
    print(name, type(model))
l1 <class 'sklearn.model_selection._search.GridSearchCV'>
l2 <class 'sklearn.model_selection._search.GridSearchCV'>
rf <class 'sklearn.model_selection._search.GridSearchCV'>
gb <class 'sklearn.model_selection._search.GridSearchCV'>
for name, model in fitted_models.items():
    try:
        model.predict(X_test)
        print(name, 'can be predicted.')
    except NotFittedError as e:
        print(repr(e))
l1 can be predicted.
l2 can be predicted.
rf can be predicted.
gb can be predicted.
for name, model in fitted_models.items():
    print(name, model.best_score_)
l1 0.7225130890052356
l2 0.7172774869109948
rf 0.7486910994764397
gb 0.7329842931937173
for name, model in fitted_models.items():
    pred_prob = model.predict_proba(X_test)
    pred_prob = [p[1] for p in pred_prob]
    print(f'{name} AUROC Score: {roc_auc_score(y_test, pred_prob):.5f}')
l1 AUROC Score: 0.67884
l2 AUROC Score: 0.70469
rf AUROC Score: 0.68497
gb AUROC Score: 0.71908

Analysis

We failed against our win condition of an AUROC score over .8

Our best score was .72 for the gradient boosting algorithm. l2 logistic regression was also close with and AUROC score of .705

To improve our score we could keep fine tuning the hyperparameters or try a deep learning algorithm.

fitted_models['gb'].best_estimator_.named_steps['gradientboostingclassifier']
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.05, loss='deviance', max_depth=1,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=123,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)
fitted_models['l2'].best_estimator_.named_steps['logisticregression']
LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=123, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
# looking at feature importances of our Gradient Boosting Classifier
importances = fitted_models['gb'].best_estimator_.named_steps['gradientboostingclassifier'].feature_importances_
import_ind = np.argsort(importances)[::-1]
ind_names = [X.columns[i] for i in import_ind]
plt.figure()
plt.title('Feature Importance')
plt.bar(range(X.shape[1]), importances[import_ind])
plt.xticks(range(X.shape[1]), ind_names, rotation=90)
plt.show()

png

Degree of malignancy and number of invasive nodes were the two most important features. This makes some intuitive sense against what we would expect. Breast quad and Breast left vs. right made no contribution so it might be worth getting rid of them as features if we run this again, especially if we could get more total data by leaving those features out.