Skip to content

Following data modelling techniques to predict mortality in heart failure patients. Class: Practical Data Science elective, Assignment 2

License

Notifications You must be signed in to change notification settings

lwhitaker-rmit/PDS-A2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDS-A2

Following data modelling techniques to predict mortality in heart failure patients. This was written as an assignment for a Practical Data Science elective I undertook in 2022. A brief report on the results is also provided.

Dataset Reference

Davide Chicco, Giuseppe Jurman: "Machine learning can predict survival of patients with heart failure from serumcreatinine and ejection fraction alone". BMC Medical Informatics and Decision Making 20, 16 (2020). [Web Link]

Dataset: Heart failure clinical records

Dependencies

This Jupyter Notebook requires Pandas, Matplotlib, NumPy, and scikit-learn. If you would like to run the cells in Assignment2.ipynb on your own, you may install these via the following commandlines, however the results of the code has been provided below, so this is not required.


Retrieving and Preparing the Data

# Initiallistions
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Initiallise raw dataframe to raw_df
raw_df = pd.read_csv('DataSets/Heart failure clinical records Data Set/heart_failure_clinical_records_dataset.csv', sep = ',')

# We will create a copy for this raw dataframe and name it 'df'
# This dataframe will be the primary source of further investigations.
# The 'raw' dataframe is kept for future reference if need be, as some data may be lost in the cleaning process.
df = raw_df.copy()
df.head()
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT
0 75.0 0 582 0 20 1 265000.00 1.9 130 1 0 4 1
1 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6 1
2 65.0 0 146 0 20 0 162000.00 1.3 129 1 1 7 1
3 50.0 1 111 0 20 0 210000.00 1.9 137 1 0 7 1
4 65.0 1 160 1 20 0 327000.00 2.7 116 0 0 8 1

Data Cleaning

Initial observations of the dataset show no null/NA values.

df.isnull().sum()
age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64
df.dtypes
age                         float64
anaemia                       int64
creatinine_phosphokinase      int64
diabetes                      int64
ejection_fraction             int64
high_blood_pressure           int64
platelets                   float64
serum_creatinine            float64
serum_sodium                  int64
sex                           int64
smoking                       int64
time                          int64
DEATH_EVENT                   int64
dtype: object

However, the 'age' category is described as a floating point decimal data type, which contradicts the dataset's description of this attribute being an integer. In order so see why this is the case, we will now display all values that are not valid as integers:

df[df['age'] % 1 != 0]['age']
185    60.667
188    60.667
Name: age, dtype: float64

As shown, the values in which are decimal are two patients with ages of '60.667'. It has been decided to round these numbers down, to properly fit within the feature's description.

df['age'] = df['age'].apply(np.floor)
df[df['age'] % 1 != 0]['age']
Series([], Name: age, dtype: float64)

There are no more decimal numbers in 'age'. We may now convert the data type back to integer.

df['age'] = df['age'].astype('int')
df['age'].dtypes
dtype('int32')

The next attribute we will look at is 'creatinine_phosphokinase'. This variable describes the patient's level of creatinine phosphokinase (CPK) in the blood. This is measured in mcg/L (micrograms per Litre) at a range of 23 to 7861 mcg/L.

print(f"""The minimum in this feature is {df['creatinine_phosphokinase'].min()}, 
    the maximum is {df['creatinine_phosphokinase'].max()}, 
    and its data type is \"{df['creatinine_phosphokinase'].dtypes}\".""")
The minimum in this feature is 23, 
    the maximum is 7861, 
    and its data type is "int64".

In order to optimise the memory usage of this dataframe, we will assume that the amount of CPK in a patient's blood will not exceed that of the maximum number describable by the int16 datatype, which is 32767, and as such convert this feature to this less memory-rich datatype.

While this won't make a significant difference in this dataset of 299 patients, this practice is healthy in the event that a larger database is provided.

df['creatinine_phosphokinase'] = df['creatinine_phosphokinase'].astype('int16')
print(f"""The minimum in this feature is {df['creatinine_phosphokinase'].min()}, 
    the maximum is {df['creatinine_phosphokinase'].max()}, 
    and its data type is \"{df['creatinine_phosphokinase'].dtypes}\".""")
The minimum in this feature is 23, 
    the maximum is 7861, 
    and its data type is "int16".

We will similarly conduct the same method for the 'age' feature, where the maximum number describable with 'int8' being 127.

df['age'] = df['age'].astype('int8')
print(f"""The minimum in this feature is {df['age'].min()},
    the maximum is {df['age'].max()}, 
    and its data type is \"{df['age'].dtypes}\".""")
The minimum in this feature is 40,
    the maximum is 95, 
    and its data type is "int8".

And continue this method of integer optimisation for the 'ejection_fraction', 'serum_sodium', and 'time' features.

df['ejection_fraction'] = df['ejection_fraction'].astype('int8')
df['serum_sodium'] = df['serum_sodium'].astype('int16')
df['time'] = df['time'].astype('int16')
df[['ejection_fraction', 'serum_sodium', 'time']].dtypes
ejection_fraction     int8
serum_sodium         int16
time                 int16
dtype: object

Whist exploring the other features, it was noticed that the 'platelets' feature in the dataset are described as "kiloplatelets/mL" (Table 1 of the data's source document), however the dataset given is in platelets/mL. This can easily be changed by dividing each platelet count by 1000.

df['platelets'] = df['platelets'].div(1000).round(2)
df['platelets'].head()
0    265.00
1    263.36
2    162.00
3    210.00
4    327.00
Name: platelets, dtype: float64

Another strage pattern that was observed is that there are 25 intances where seperate patients have the exact same platelet count of 263.36 kiloplatelets/mL.

df['platelets'].value_counts()
263.36    25
255.00     4
237.00     4
271.00     4
228.00     4
          ..
155.00     1
257.00     1
309.00     1
461.00     1
336.00     1
Name: platelets, Length: 176, dtype: int64

This is highly obscure and most likely an error with the data set, either with input or otherwise. These patients also display no other correlations, and as such, their observations must be removed from the database in order to ensure the modelling proccess is not given misleading data.

As far as I can tell, this anomaly has been overlooked by the original document.

df[df['platelets'] == 263.36].head()
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT
1 55 0 7861 0 38 0 263.36 1.10 136 1 0 6 1
8 65 0 157 0 65 0 263.36 1.50 138 0 0 10 1
24 75 0 582 1 30 1 263.36 1.83 134 0 0 23 1
30 94 0 582 1 38 1 263.36 1.83 134 1 0 27 1
40 70 0 582 0 20 1 263.36 1.83 134 1 1 31 1
df = df[df['platelets'] != 263.36].reset_index(drop=True)
df[df['platelets'] == 263.36]['platelets'].count()
0
print(f"The dataset now contains {df.shape[0]} patients.")
The dataset now contains 274 patients.

Similarly, the CPK feature contains 34 instances where the patient's reading is 582 mcg/L.

df['creatinine_phosphokinase'].value_counts()
582     34
66       4
129      4
47       3
115      3
        ..
149      1
151      1
154      1
156      1
1021     1
Name: creatinine_phosphokinase, Length: 197, dtype: int64

This is highly unlikely due to the measurement type being a highly variable statistic. It is again assumed to be erroneous and will be removed from the dataset.

df[df['creatinine_phosphokinase'] == 582].head()
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT
0 75 0 582 0 20 1 265.0 1.9 130 1 0 4 1
15 45 0 582 0 14 0 166.0 0.8 127 1 0 14 1
17 48 1 582 1 55 0 87.0 1.9 121 0 0 15 1
31 69 0 582 1 35 0 228.0 3.5 134 1 0 30 1
40 50 0 582 1 38 0 310.0 1.9 135 1 1 35 1
df = df[df['creatinine_phosphokinase'] != 582].reset_index(drop=True)
df[df['creatinine_phosphokinase'] == 582]['creatinine_phosphokinase'].count()
0
print(f"The dataset now contains {df.shape[0]} patients.")
The dataset now contains 240 patients.

Pre-Processing

This section will primarily focus on pre-processing the binary features of the dataset to improve the data exploration section. Before continuing, we will make a copy of the dataframe in its current state such that it can be used for the modelling process later. The features in question are 'anaemia', 'diabetes', 'high_blood_pressure', 'smoking', and 'DEATH_EVENT'.These variables describe a boolean value, where 0 is false and 1 is true. Another binary variable in this dataset describes 'sex' as either a 0 or a 1, however which gender each number signifies is unfortunately not defined in either the UCI Repository nor the data's source document.

These variable values, other than those given in 'sex', will now be given a more descriptive name.

df_before_pp = df.copy()
df[['anaemia', 'diabetes', 'high_blood_pressure', 'smoking', 'DEATH_EVENT']]
anaemia diabetes high_blood_pressure smoking DEATH_EVENT
0 0 0 0 1 1
1 1 0 0 0 1
2 1 1 0 0 1
3 1 0 1 1 1
4 1 0 0 0 1
... ... ... ... ... ...
235 0 1 1 1 0
236 0 0 0 0 0
237 0 1 0 0 0
238 0 0 0 1 0
239 0 0 0 1 0

240 rows Ă— 5 columns

df[['anaemia', 'diabetes', 'high_blood_pressure', 'smoking', 'DEATH_EVENT']] = df[['anaemia', 'diabetes', 'high_blood_pressure', 'smoking', 'DEATH_EVENT']].astype("bool")

# Create categories
cat_anaemia   = ['anaemic', 'normal']
cat_diabetes  = ['diabetic', 'normal']
cat_hbp       = ['hypertensive', 'normal']
cat_smoking   = ['smoker', 'non-smoker']
cat_death     = ['dead', 'alive']

# Apply categories
df['anaemia'] = np.select([df['anaemia'] == True, df['anaemia'] == False], cat_anaemia)
df['diabetes'] = np.select([df['diabetes'] == True, df['diabetes'] == False], cat_diabetes)
df['high_blood_pressure'] = np.select([df['high_blood_pressure'] == True, df['high_blood_pressure'] == False], cat_hbp)
df['smoking'] = np.select([df['smoking'] == True, df['smoking'] == False], cat_smoking)
df['DEATH_EVENT'] = np.select([df['DEATH_EVENT'] == True, df['DEATH_EVENT'] == False], cat_death)

# Additionally apply DEATH_EVENT category to df_before_pp
df_before_pp['DEATH_EVENT'] = np.select([df_before_pp['DEATH_EVENT'] == 1, df_before_pp['DEATH_EVENT'] == 0], cat_death)

df[['anaemia', 'diabetes', 'high_blood_pressure', 'smoking', 'DEATH_EVENT']]
anaemia diabetes high_blood_pressure smoking DEATH_EVENT
0 normal normal normal smoker dead
1 anaemic normal normal non-smoker dead
2 anaemic diabetic normal non-smoker dead
3 anaemic normal hypertensive smoker dead
4 anaemic normal normal non-smoker dead
... ... ... ... ... ...
235 normal diabetic hypertensive smoker alive
236 normal normal normal non-smoker alive
237 normal diabetic normal non-smoker alive
238 normal normal normal smoker alive
239 normal normal normal smoker alive

240 rows Ă— 5 columns

In this section, for simplicity we will also divide the 'age' feature into two classifications as seperated by the median age: 'middleage': 40-59 and 'elderly': 60+.

print(f"The median age for this dataset is {df['age'].median().astype('int')}.")
The median age for this dataset is 60.
conditions = [df['age'] < df['age'].median(), df['age'] >= df['age'].median()]
category = ["middleage", "elderly"]

df['age'] = np.select(conditions, category)
df[['age']].head()
age
0 elderly
1 middleage
2 elderly
3 elderly
4 elderly

Now that the dataset and dataframe is clean and preprocessed, we are ready for investigation.

Data Exploration

# Adjust font size here
font = {'size'   : 22}
plt.rc('font', **font)
fig01, ax01 = plt.subplots()

ax01.bar(df['age'].unique(), df['age'].value_counts())

plt.title("fig.1 Age Classification Distribution of Dataset")
plt.ylabel("Count")
fig01.set_size_inches(17, 10)
plt.show()

png

df['age'].value_counts()
elderly      141
middleage     99
Name: age, dtype: int64

There are significantly more elderly patients in this dataset than middleaged patients.

fig02, ax02 = plt.subplots()

ax02.bar(df['anaemia'].unique(), df['anaemia'].value_counts())

plt.title("fig.2 Anaemic Patient Distribution of Dataset")
plt.ylabel("Count")
fig02.set_size_inches(17, 10)
plt.show()

png

df['anaemia'].value_counts()
normal     120
anaemic    120
Name: anaemia, dtype: int64

Exactly 50% of patients in the dataset suffer from anaemia. This is nearly twice the expected world average, which is said to be anywhere from 25% {ekfdiagnostics.com} to 30% {clevelandclinic.org}.

fig03, ax03 = plt.subplots()

ax03.bar(df['high_blood_pressure'].unique(), df['high_blood_pressure'].value_counts())

plt.title("fig.3 Hypertensive Patient Distribution of Dataset")
plt.ylabel("Count")
fig03.set_size_inches(17, 10)
plt.show()

png

df['high_blood_pressure'].value_counts()
normal          152
hypertensive     88
Name: high_blood_pressure, dtype: int64
fig04, ax04 = plt.subplots()

ax04.bar(df['diabetes'].unique(), df['diabetes'].value_counts())

plt.title("fig.4 Diabetes Patient Distribution of Dataset")
plt.ylabel("Count")
fig04.set_size_inches(17, 10)
plt.show()

png

fig05, ax05 = plt.subplots()

ax05.bar([str(i) for i in df['sex'].unique()], df['sex'].value_counts())

plt.title("fig.5 Patient Gender Distribution of Dataset")
plt.ylabel("Count")
fig05.set_size_inches(17, 10)
plt.show()

png

There is significantly more of one gender in this dataset than the other, yet once again, we cannot know for sure which gender is which as it is not mentioned in the sources.

fig06, ax06 = plt.subplots()

ax06.bar(df['DEATH_EVENT'].unique(), df['DEATH_EVENT'].value_counts())

plt.title("fig.6 Patient Death Distribution of Dataset")
plt.ylabel("Count")
fig06.set_size_inches(17, 10)
plt.show()

png

fig07, ax07 = plt.subplots()

ax07.boxplot(df['ejection_fraction'], vert = False, flierprops = dict(markerfacecolor='b', marker='D'), labels = ['Patients'])
plt.title("fig.7 Ejection Fraction of Patients")
plt.xlabel("Ejection Fraction (%)")
fig07.set_size_inches(17, 10)
plt.show()

png

fig08, ax08 = plt.subplots()

ax08.boxplot(df['platelets'], vert = False, flierprops = dict(markerfacecolor='b', marker='D'), labels = ['Patients'])
plt.title("fig.8 Platelet Count of Patients")
plt.xlabel("kiloplatelets/mL")
fig08.set_size_inches(17, 10)
plt.show()

png

fig09, ax09 = plt.subplots()

ax09.boxplot(df['time'], vert = False, flierprops = dict(markerfacecolor='b', marker='D'), labels = ['Patients'])
plt.title("fig.9 Follow-Up Period of Patients")
plt.xlabel("Days")
fig09.set_size_inches(17, 10)
plt.show()

png

fig10, ax10 = plt.subplots()

ax10.bar(df['smoking'].unique(), df['smoking'].value_counts(sort = False))

plt.title("fig.10 Smoking Patients Distribution of Dataset")
plt.ylabel("Count")
fig10.set_size_inches(17, 10)
plt.show()

png

df['smoking'].value_counts(sort = False)
non-smoker    162
smoker         78
Name: smoking, dtype: int64
print('Percentage of smokers: ', round(df[df['smoking'] == 'smoker'].shape[0] / df.shape[0], 2) * 100,'%.')
Percentage of smokers:  33.0 %.

Another huge cause for concern: there are significantly more heart failure patients who smoke than those who don't.

=====================================

We will proceed with this task beginning by studying the relationship between creatinine phosphokinase (CPK) and serum creatinine.

The normal level of CPK in the blood is 10-120 mcg/L. {mountsinai.org}

fig11a, ax11a = plt.subplots()

ax11a.scatter(df[df['DEATH_EVENT'] == 'alive']['creatinine_phosphokinase'], 
             df[df['DEATH_EVENT'] == 'alive']['serum_creatinine'],
            label = "Alive")
ax11a.scatter(df[df['DEATH_EVENT'] == 'dead']['creatinine_phosphokinase'], 
             df[df['DEATH_EVENT'] == 'dead']['serum_creatinine'],
            label = "Dead")

plt.title("fig.11a Creatinine Presence in Blood of Alive/Deceased Patients")
plt.xlabel("CPK (Creatinine Phosphokinase) in blood (mcg/L)")
plt.ylabel("Serum Creatinine (mg/dL)")
ax11a.legend()
fig11a.set_size_inches(17, 10)
plt.show()

png

fig11b, ax11b = plt.subplots()

ax11b.scatter(df[df['DEATH_EVENT'] == 'alive']['creatinine_phosphokinase'], 
             df[df['DEATH_EVENT'] == 'alive']['serum_creatinine'],
            label = "Alive")
ax11b.scatter(df[df['DEATH_EVENT'] == 'dead']['creatinine_phosphokinase'], 
             df[df['DEATH_EVENT'] == 'dead']['serum_creatinine'],
            label = "Dead")

# Display max normal CPK level dashed line:
plt.axvline(x=120, color='grey', linestyle='--')

plt.title("fig.11b Creatinine Presence in Blood of Alive/Deceased Patients (log(CPK))")
plt.xlabel("CPK (Creatinine Phosphokinase) in blood (mcg/L)")
plt.ylabel("Serum Creatinine (mg/dL)")
ax11b.legend()
fig11b.set_size_inches(17, 10)

ax11b.set_xscale('log')
plt.show()

png

At a glance, it can be seen that the levels of CPK and serum creatinine are inversely related. That is, as the amount of CPK in the blood increases, the amount of Serum Creatine decreases, and vice versa.

normal = df[df['creatinine_phosphokinase'] <= 120]
normal[normal['DEATH_EVENT'] == 'alive'].shape[0]
58
fig11c, ax11c = plt.subplots()

normal = df[df['creatinine_phosphokinase'] <= 120]
abnormal = df[df['creatinine_phosphokinase'] > 120]

labels = ['Normal CPK Levels', 'Abnormal CPK Levels']
alive = [normal[normal['DEATH_EVENT'] == 'alive'].shape[0], abnormal[abnormal['DEATH_EVENT'] == 'alive'].shape[0]]
deceased = [normal[normal['DEATH_EVENT'] == 'dead'].shape[0], abnormal[abnormal['DEATH_EVENT'] == 'dead'].shape[0]]

x = np.arange(len(labels))

width = 0.35

rects1 = ax11c.bar(x - width/2, alive, width, label='Alive')
rects2 = ax11c.bar(x + width/2, deceased, width, label='Deceased')

ax11c.set_ylabel('Count')
ax11c.set_title('fig.11c Number of Patients with Normal and Abnormal CPK Levels')
plt.xticks(x)
ax11c.set_xticklabels(labels)
ax11c.legend()

fig11c.set_size_inches(17, 10)

plt.show()

png

print("Mortality increases from ", 
      round(normal[normal['DEATH_EVENT'] == 'dead'].shape[0] / normal.shape[0], 2) * 100,
      "% to ",
      round(abnormal[abnormal['DEATH_EVENT'] == 'dead'].shape[0] / abnormal.shape[0], 2) * 100,
      "%.")
Mortality increases from  25.0 % to  34.0 %.

Data Modelling

The target value(s) for this dataset is the 'DEATH_EVENT' feature. We will be using the random state of '63'.

# Initiallising modelling and target DataFrames
model_df = df_before_pp.loc[:, df.columns != 'DEATH_EVENT']
target_df = df_before_pp['DEATH_EVENT']
model_df.shape
(240, 12)
target_df.shape
(240,)
# Import ML and analysis modules
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report


# Splitting data 90-10: 90% for training, 10% for testing
# With a small dataset, we require as much training as possible
x_train, x_test, y_train, y_test = train_test_split(model_df, target_df, test_size = 0.1, random_state = 63)
x_train, x_test, y_train, y_test
(     age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
 186   65        0                        56         0                 25   
 155   60        1                       104         1                 30   
 90    45        0                       292         1                 35   
 190   40        1                       129         0                 35   
 235   62        0                        61         1                 38   
 ..   ...      ...                       ...       ...                ...   
 82    65        1                       305         0                 25   
 215   65        1                       258         1                 25   
 139   59        0                        66         1                 20   
 116   46        1                       291         0                 35   
 44    70        1                        75         0                 35   
 
      high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
 186                    0      237.0               5.0           130    0   
 155                    0      389.0               1.5           136    1   
 90                     0      850.0               1.3           142    1   
 190                    0      255.0               0.9           137    1   
 235                    1      155.0               1.1           143    1   
 ..                   ...        ...               ...           ...  ...   
 82                     0      298.0               1.1           141    1   
 215                    0      198.0               1.4           129    1   
 139                    0       70.0               2.4           134    1   
 116                    0      348.0               0.9           140    0   
 44                     0      223.0               2.7           138    1   
 
      smoking  time  
 186        0   207  
 155        0   171  
 90         1    88  
 190        0   209  
 235        1   270  
 ..       ...   ...  
 82         0    87  
 215        0   235  
 139        0   135  
 116        0   109  
 44         1    54  
 
 [216 rows x 12 columns],
      age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
 52    42        1                       250         1                 15   
 154   58        1                       145         0                 25   
 191   53        1                       707         0                 38   
 145   60        1                        95         0                 60   
 68    51        0                        78         0                 50   
 146   69        0                      1419         0                 40   
 230   90        1                       337         0                 38   
 161   73        1                       231         1                 30   
 63    55        0                       336         0                 45   
 107   53        1                       270         1                 35   
 10    50        1                       168         0                 38   
 173   49        0                       972         1                 35   
 233   52        0                       190         1                 38   
 76    58        0                       144         1                 38   
 51    60        0                        68         0                 20   
 72    57        1                       115         0                 25   
 194   70        0                       232         0                 30   
 89    63        0                       936         0                 38   
 93    50        0                       369         1                 25   
 130   50        0                       250         0                 25   
 163   65        0                       167         0                 30   
 138   53        0                       196         0                 60   
 42    60        1                       260         1                 38   
 59    60        1                        47         0                 20   
 
      high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
 52                     0      213.0              1.30           136    0   
 154                    0      219.0              1.20           137    1   
 191                    0      330.0              1.40           137    1   
 145                    0      337.0              1.00           138    1   
 68                     0      406.0              0.70           140    1   
 146                    0      105.0              1.00           135    1   
 230                    0      390.0              0.90           144    0   
 161                    0      160.0              1.18           142    1   
 63                     1      324.0              0.90           140    0   
 107                    0      227.0              3.40           145    1   
 10                     1      276.0              1.10           137    1   
 173                    1      268.0              0.80           130    0   
 233                    0      382.0              1.00           140    1   
 76                     1      327.0              0.70           142    0   
 51                     0      119.0              2.90           127    1   
 72                     1      181.0              1.10           144    1   
 194                    0      173.0              1.20           132    1   
 89                     0      304.0              1.10           133    1   
 93                     0      252.0              1.60           136    1   
 130                    0      262.0              1.00           136    1   
 163                    0      259.0              0.80           138    0   
 138                    0      220.0              0.70           133    1   
 42                     0      255.0              2.20           132    0   
 59                     0      204.0              0.70           139    1   
 
      smoking  time  
 52         0    65  
 154        1   170  
 191        1   209  
 145        1   146  
 68         0    79  
 146        1   147  
 230        0   256  
 161        1   180  
 63         0    74  
 107        0   105  
 10         0    11  
 173        0   187  
 233        1   258  
 76         0    83  
 51         1    64  
 72         0    79  
 194        0   210  
 89         1    88  
 93         0    90  
 130        1   120  
 163        0   186  
 138        1   134  
 42         1    45  
 59         1    73  ,
 186    alive
 155     dead
 90     alive
 190    alive
 235    alive
        ...  
 82     alive
 215     dead
 139     dead
 116    alive
 44     alive
 Name: DEATH_EVENT, Length: 216, dtype: object,
 52      dead
 154     dead
 191    alive
 145    alive
 68     alive
 146    alive
 230    alive
 161    alive
 63     alive
 107    alive
 10      dead
 173    alive
 233    alive
 76     alive
 51      dead
 72     alive
 194    alive
 89     alive
 93     alive
 130    alive
 163    alive
 138    alive
 42      dead
 59      dead
 Name: DEATH_EVENT, dtype: object)
x_train.shape, x_test.shape
((216, 12), (24, 12))
y_train.shape, y_test.shape
((216,), (24,))
# This loop manually finds an acceptable Nearest Neighbor value
for i in range(1,20):
    # Iterate NN & fit
    clf = KNeighborsClassifier(i)
    clf.fit(x_train, y_train)
    
    # pred --> predicted result from test
    pred = clf.predict(x_test)
    pred, y_test
    
    # result --> DataFrame containing the predicted vs actual result
    result = pd.DataFrame(data = {
        'Predicted': pred,
        'Actual': y_test
    })
    # Primary concern: True Negatives (correct death guesses)
    # There are unfortunately no cases where the death of a patient 
    # was predicted successfully more than 3 times.
    if result[(result['Predicted'] == 'dead') & (result['Actual'] == 'dead')].shape[0] > 3:
        print('Nearest Neighbor: ', i, 'Correct Guesses: ', result[result['Predicted'] == result['Actual']].shape[0], ' Correct Death Guesses: ', result[(result['Predicted'] == 'dead') & (result['Actual'] == 'dead')].shape[0])
Nearest Neighbor:  13 Correct Guesses:  21  Correct Death Guesses:  4
Nearest Neighbor:  17 Correct Guesses:  21  Correct Death Guesses:  4

Nearest Neighbor of 13 or 17 is ideal, as it returns an acceptable prediction. We will use 13.

# Set nearest neighbor value
clf = KNeighborsClassifier(13)
# Fit prediction
clf.fit(x_train, y_train)
pred = clf.predict(x_test)
# Confusion Matirx
cm = confusion_matrix(y_test, pred)
cm
array([[17,  1],
       [ 2,  4]], dtype=int64)
# Classification report
print(classification_report(y_test, pred))
              precision    recall  f1-score   support

       alive       0.89      0.94      0.92        18
        dead       0.80      0.67      0.73         6

    accuracy                           0.88        24
   macro avg       0.85      0.81      0.82        24
weighted avg       0.87      0.88      0.87        24

About

Following data modelling techniques to predict mortality in heart failure patients. Class: Practical Data Science elective, Assignment 2

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published