Breast Cancer Remission Classification

I will be analyzing a dataset about Breast Cancer remission. The target variable is predicting remission, so we will be performing binary classification. The data is categorical in nature.

Win condition: AUROC Score of 80% or higher

Here is the data citation:

Title: Breast cancer data (Michalski has used this)

Sources:
– Matjaz Zwitter & Milan Soklic (physicians)
Institute of Oncology
University Medical Center
Ljubljana, Yugoslavia
– Donors: Ming Tan and Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
– Date: 11 July 1988

Import Libraries

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_style('darkgrid')

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV

import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

from sklearn.exceptions import NotFittedError

from sklearn.metrics import roc_auc_score

Exploratory Analysis and Data Wrangling

From the included .names file:

Number of Instances: 286

Number of Attributes: 9 + the class attribute

Attribute Information:

Class: no-recurrence-events, recurrence-events
age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
menopause: lt40, ge40, premeno.
tumor_size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
inv_nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
node-caps: yes, no.
deg_malig: 1, 2, 3.
breast: left, right.
breast_quad: left-up, left-low, right-up, right-low, central.
irradiat: yes, no.

Missing Attribute Values: (denoted by “?”) Attribute #: Number of instances with missing values:
node_caps: 8
breast_quad: 1

Class Distribution: 1. no-recurrence-events: 201 instances 2. recurrence-events: 85 instances

# Importing data
df = pd.read_csv('breast-cancer.data', header=None, names=['target', 'age', 'menopause', 'tumor_size', 'inv_nodes', 'node_caps', 'deg_malig', 'breast', 'breast_quad', 'irradiat'])
df.head()

	target	age	menopause	tumor_size	inv_nodes	node_caps	deg_malig	breast	breast_quad	irradiat
0	no-recurrence-events	30-39	premeno	30-34	0-2	no	3	left	left_low	no
1	no-recurrence-events	40-49	premeno	20-24	0-2	no	2	right	right_up	no
2	no-recurrence-events	40-49	premeno	20-24	0-2	no	2	left	left_low	no
3	no-recurrence-events	60-69	ge40	15-19	0-2	no	2	right	left_up	no
4	no-recurrence-events	40-49	premeno	0-4	0-2	no	2	right	right_low	no

df.dtypes

target         object
age            object
menopause      object
tumor_size     object
inv_nodes      object
node_caps      object
deg_malig       int64
breast         object
breast_quad    object
irradiat       object
dtype: object

df.hist()

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a16401e10>]],
      dtype=object)

png

#change how missing data is labeled
mask_1 = df['node_caps'].str.contains('\?')

df.loc[mask_1,'node_caps'] = 'missing'
df['node_caps_missing'] = mask_1
df[mask_1]

	target	age	menopause	tumor_size	inv_nodes	node_caps	deg_malig	breast	breast_quad	irradiat	node_caps_missing
145	no-recurrence-events	40-49	premeno	25-29	0-2	missing	2	left	right_low	yes	True
163	no-recurrence-events	60-69	ge40	25-29	3-5	missing	1	right	left_up	yes	True
164	no-recurrence-events	60-69	ge40	25-29	3-5	missing	1	right	left_low	yes	True
183	no-recurrence-events	50-59	ge40	30-34	9-11	missing	3	left	left_up	yes	True
184	no-recurrence-events	50-59	ge40	30-34	9-11	missing	3	left	left_low	yes	True
233	recurrence-events	70-79	ge40	15-19	9-11	missing	1	left	left_low	yes	True
263	recurrence-events	50-59	lt40	20-24	0-2	missing	1	left	left_up	no	True
264	recurrence-events	50-59	lt40	20-24	0-2	missing	1	left	left_low	no	True

mask_2 = df['breast_quad'].str.contains('\?')

df.loc[mask_2,'breast_quad'] = 'missing'
df['breast_quad_missing'] = mask_2
df[mask_2]

	target	age	menopause	tumor_size	inv_nodes	node_caps	deg_malig	breast	breast_quad	irradiat	node_caps_missing	breast_quad_missing
206	recurrence-events	50-59	ge40	30-34	0-2	no	3	left	missing	no	False	True

df['node_caps_missing'] = df['node_caps_missing'].astype(int)
df['breast_quad_missing'] = df['breast_quad_missing'].astype(int)

df.describe(include='object')

	target	age	menopause	tumor_size	inv_nodes	node_caps	breast	breast_quad	irradiat
count	286	286	286	286	286	286	286	286	286
unique	2	6	3	11	7	3	2	6	2
top	no-recurrence-events	50-59	premeno	30-34	0-2	no	left	left_low	no
freq	201	96	150	60	213	222	152	110	218

for column in df.dtypes[df.dtypes == 'object'].index:
    sns.countplot(y=df[column])
    plt.show()

png

There are sparse classes in these features: age, menopause, tumor_size, inv_nodes, breast_quad.

Because age, tumor_size, and inv_nodes are ordinal features we can handle the issue of sparse classes by changing those three features into integer values.

We can handle menopause feature by changing it to a binary feature: premenopause or not.

For the breast_quad data I’m going to combine the right_up, right_low, and central classes into a single ‘other’ class.

# Change age to an ordinal feature
df.age.unique()

array(['30-39', '40-49', '60-69', '50-59', '70-79', '20-29'], dtype=object)

age_map = {'20-29':1, '30-39':2, '40-49':3, '50-59':4, '60-69':5, '70-79':6}
df['age'] = df.age.replace(age_map)
df.age.head()

  2
  3
  3
  5
  3
Name: age, dtype: int64

# Change tumor_size to an ordinal feature
df.tumor_size.unique()

array(['30-34', '20-24', '15-19', '0-4', '25-29', '50-54', '10-14',
       '40-44', '35-39', '5-9', '45-49'], dtype=object)

tumor_size_map = {'0-4':1, '5-9':2, '10-14':3,'15-19':4,'20-24':5,'25-29':6,'30-34':7,'35-39':8,'40-44':9,'45-49':10,'50-54':11}
df['tumor_size'] = df.tumor_size.replace(tumor_size_map)
df.tumor_size.head()

  7
  5
  5
  4
  1
Name: tumor_size, dtype: int64

# change inv_nodes to ordinal feature
df.inv_nodes.unique()

array(['0-2', '6-8', '9-11', '3-5', '15-17', '12-14', '24-26'],
      dtype=object)

inv_nodes_map = {'0-2':1, '3-5':2,'6-8':3, '9-11':4, '12-14':5, '15-17':6, '18-20':7, '21-23':8, '24-26':9}
df['inv_nodes'] = df.inv_nodes.replace(inv_nodes_map)
df.inv_nodes.head()

  1
  1
  1
  1
  1
Name: inv_nodes, dtype: int64

df.hist(figsize=(10,10))
plt.show()

png

df['premeno'] = df.menopause=='premeno'
df['premeno'] = df.premeno.astype(int)
df.premeno.head()
df.drop(columns=['menopause'], inplace=True)

  1
  1
  1
  0
  1
Name: premeno, dtype: int64

df.head()

	target	age	tumor_size	inv_nodes	deg_malig	breast	breast_quad	premeno
0	no-recurrence-events	2	7	1	3	left	left_low	1
1	no-recurrence-events	3	5	1	2	right	right_up	1
2	no-recurrence-events	3	5	1	2	left	left_low	1
3	no-recurrence-events	5	4	1	2	right	left_up	0
4	no-recurrence-events	3	1	1	2	right	right_low	1

# change yes and no classes to integer values
df['node_caps'] = df.node_caps.replace({'no':0, 'yes':1})
df['irradiat'] = df.irradiat.replace({'no':0, 'yes':1})

# rename ordinal features
rename_ordinal_map = {
    'age' : 'age_ordinal',
    'tumor_size' : 'tumor_size_ordinal',
    'inv_nodes' : 'inv_nodes_ordinal'
}
df.rename(columns = rename_ordinal_map, inplace=True)

# rename target and change to 0s and 1s
df.rename(columns={'target':'no_recurrence'}, inplace=True)
df['no_recurrence'] = df.no_recurrence.replace({'no-recurrence-events':1, 'recurrence-events':0})

# replacing sparse classes in breast_quad with 'other'
df.breast_quad.unique()

array(['left_low', 'right_up', 'left_up', 'right_low', 'central',
       'missing'], dtype=object)

df.replace(['right_up','right_low', 'central'], 'other', inplace=True)

df.head()

	no_recurrence	age_ordinal	tumor_size_ordinal	inv_nodes_ordinal	deg_malig	breast	breast_quad	premeno
0	1	2	7	1	3	left	left_low	1
1	1	3	5	1	2	right	other	1
2	1	3	5	1	2	left	left_low	1
3	1	5	4	1	2	right	left_up	0
4	1	3	1	1	2	right	other	1

df.dtypes

no_recurrence           int64
age_ordinal             int64
tumor_size_ordinal      int64
inv_nodes_ordinal       int64
node_caps              object
deg_malig               int64
breast                 object
breast_quad            object
irradiat                int64
node_caps_missing       int64
breast_quad_missing     int64
premeno                 int64
dtype: object

# since get dummies will handle the missing columns I'm going to remove them first so that I don't have duplicate data
df.drop(columns = ['node_caps_missing', 'breast_quad_missing'], inplace=True)

df.head()

	no_recurrence	age_ordinal	tumor_size_ordinal	inv_nodes_ordinal	deg_malig	breast	breast_quad	premeno
0	1	2	7	1	3	left	left_low	1
1	1	3	5	1	2	right	other	1
2	1	3	5	1	2	left	left_low	1
3	1	5	4	1	2	right	left_up	0
4	1	3	1	1	2	right	other	1

abt = pd.get_dummies(df, columns=['node_caps', 'breast', 'breast_quad'])

abt.head()

	no_recurrence	age_ordinal	tumor_size_ordinal	inv_nodes_ordinal	deg_malig	premeno	node_caps_0	breast_left	breast_right	breast_quad_left_low	breast_quad_left_up	breast_quad_other
0	1	2	7	1	3	1	1	1	0	1	0	0
1	1	3	5	1	2	1	1	0	1	0	0	1
2	1	3	5	1	2	1	1	1	0	1	0	0
3	1	5	4	1	2	0	1	0	1	0	1	0
4	1	3	1	1	2	1	1	0	1	0	0	1

Training and Evaluating Models

y = abt.no_recurrence
X = abt.drop(['no_recurrence'], axis=1)

X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=.33, random_state=123)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(191, 15) (95, 15) (191,) (95,)

# creating pipelines for logistic regression using l1 and l2 regularization
# creating pipelines for randomforest and gradientboosting classifiers
pipelines = {
    'l1': make_pipeline(StandardScaler(), LogisticRegression(penalty='l1', random_state=123)),
    'l2': make_pipeline(StandardScaler(), LogisticRegression(penalty='l2', random_state=123)),
    'rf': make_pipeline(StandardScaler(), RandomForestClassifier(random_state=123)),
    'gb': make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=123))
}

# setting hyperparameters for l1 and l2 logistic regression
l1_hyperparameters = {'logisticregression__C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000]}
l2_hyperparameters = {'logisticregression__C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000]}

rf_hyperparameters = {
    'randomforestclassifier__n_estimators': [100, 200],
    'randomforestclassifier__max_features': ['auto', 'sqrt', 0.33, 0.5],
    'randomforestclassifier__min_samples_leaf': [1, 3, 5, 10]
}

gb_hyperparameters = {
    'gradientboostingclassifier__n_estimators': [100, 200],
    'gradientboostingclassifier__learning_rate': [0.05, 0.1, 0.2],
    'gradientboostingclassifier__max_depth': [1, 3, 5]
}

# setting up hyperparameters dictionary for pipeline workflow
hyperparameters = {
    'l1':l1_hyperparameters,
    'l2':l2_hyperparameters,
    'rf':rf_hyperparameters,
    'gb':gb_hyperparameters
}

fitted_models = {}

for name, pipeline in pipelines.items():
    model = GridSearchCV(pipeline, hyperparameters[name], cv=10, n_jobs=-1)
    model.fit(X_train, y_train)
    fitted_models[name] = model
    print(name, 'has been fitted.')

/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

l1 has been fitted.

/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

l2 has been fitted.

/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)

rf has been fitted.
gb has been fitted.

/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)

for name, model in fitted_models.items():
    print(name, type(model))

l1 <class 'sklearn.model_selection._search.GridSearchCV'>
l2 <class 'sklearn.model_selection._search.GridSearchCV'>
rf <class 'sklearn.model_selection._search.GridSearchCV'>
gb <class 'sklearn.model_selection._search.GridSearchCV'>

for name, model in fitted_models.items():
    try:
        model.predict(X_test)
        print(name, 'can be predicted.')
    except NotFittedError as e:
        print(repr(e))

l1 can be predicted.
l2 can be predicted.
rf can be predicted.
gb can be predicted.

for name, model in fitted_models.items():
    print(name, model.best_score_)

l1 0.7225130890052356
l2 0.7172774869109948
rf 0.7486910994764397
gb 0.7329842931937173

for name, model in fitted_models.items():
    pred_prob = model.predict_proba(X_test)
    pred_prob = [p[1] for p in pred_prob]
    print(f'{name} AUROC Score: {roc_auc_score(y_test, pred_prob):.5f}')

l1 AUROC Score: 0.67884
l2 AUROC Score: 0.70469
rf AUROC Score: 0.68497
gb AUROC Score: 0.71908

Analysis

We failed against our win condition of an AUROC score over .8

Our best score was .72 for the gradient boosting algorithm. l2 logistic regression was also close with and AUROC score of .705

To improve our score we could keep fine tuning the hyperparameters or try a deep learning algorithm.

fitted_models['gb'].best_estimator_.named_steps['gradientboostingclassifier']

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.05, loss='deviance', max_depth=1,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=123,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

fitted_models['l2'].best_estimator_.named_steps['logisticregression']

LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=123, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

# looking at feature importances of our Gradient Boosting Classifier
importances = fitted_models['gb'].best_estimator_.named_steps['gradientboostingclassifier'].feature_importances_
import_ind = np.argsort(importances)[::-1]
ind_names = [X.columns[i] for i in import_ind]
plt.figure()
plt.title('Feature Importance')
plt.bar(range(X.shape[1]), importances[import_ind])
plt.xticks(range(X.shape[1]), ind_names, rotation=90)
plt.show()

png

Degree of malignancy and number of invasive nodes were the two most important features. This makes some intuitive sense against what we would expect. Breast quad and Breast left vs. right made no contribution so it might be worth getting rid of them as features if we run this again, especially if we could get more total data by leaving those features out.

	no_recurrence	age_ordinal	tumor_size_ordinal	inv_nodes_ordinal	deg_malig	premeno	node_caps_0	breast_left	breast_right	breast_quad_left_low	breast_quad_left_up	breast_quad_other
0	1	2	7	1	3	1	1	1	0	1	0	0
1	1	3	5	1	2	1	1	0	1	0	0	1
2	1	3	5	1	2	1	1	1	0	1	0	0
3	1	5	4	1	2	0	1	0	1	0	1	0
4	1	3	1	1	2	1	1	0	1	0	0	1

	no_recurrence	age_ordinal	tumor_size_ordinal	inv_nodes_ordinal	deg_malig	premeno	node_caps_0	breast_left	breast_right	breast_quad_left_low	breast_quad_left_up	breast_quad_other
0	1	2	7	1	3	1	1	1	0	1	0	0
1	1	3	5	1	2	1	1	0	1	0	0	1
2	1	3	5	1	2	1	1	1	0	1	0	0
3	1	5	4	1	2	0	1	0	1	0	1	0
4	1	3	1	1	2	1	1	0	1	0	0	1

	no_recurrence	age_ordinal	tumor_size_ordinal	inv_nodes_ordinal	deg_malig	premeno	node_caps_0	breast_left	breast_right	breast_quad_left_low	breast_quad_left_up	breast_quad_other
0	1	2	7	1	3	1	1	1	0	1	0	0
1	1	3	5	1	2	1	1	0	1	0	0	1
2	1	3	5	1	2	1	1	1	0	1	0	0
3	1	5	4	1	2	0	1	0	1	0	1	0
4	1	3	1	1	2	1	1	0	1	0	0	1