Following data modelling techniques to predict mortality in heart failure patients. This was written as an assignment for a Practical Data Science elective I undertook in 2022. A brief report on the results is also provided.
Davide Chicco, Giuseppe Jurman: "Machine learning can predict survival of patients with heart failure from serumcreatinine and ejection fraction alone". BMC Medical Informatics and Decision Making 20, 16 (2020). [Web Link]
Dataset: Heart failure clinical records
This Jupyter Notebook requires Pandas, Matplotlib, NumPy, and scikit-learn. If you would like to run the cells in Assignment2.ipynb on your own, you may install these via the following commandlines, however the results of the code has been provided below, so this is not required.
# Initiallistions
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Initiallise raw dataframe to raw_df
raw_df = pd.read_csv('DataSets/Heart failure clinical records Data Set/heart_failure_clinical_records_dataset.csv', sep = ',')
# We will create a copy for this raw dataframe and name it 'df'
# This dataframe will be the primary source of further investigations.
# The 'raw' dataframe is kept for future reference if need be, as some data may be lost in the cleaning process.
df = raw_df.copy()
df.head()| age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 75.0 | 0 | 582 | 0 | 20 | 1 | 265000.00 | 1.9 | 130 | 1 | 0 | 4 | 1 |
| 1 | 55.0 | 0 | 7861 | 0 | 38 | 0 | 263358.03 | 1.1 | 136 | 1 | 0 | 6 | 1 |
| 2 | 65.0 | 0 | 146 | 0 | 20 | 0 | 162000.00 | 1.3 | 129 | 1 | 1 | 7 | 1 |
| 3 | 50.0 | 1 | 111 | 0 | 20 | 0 | 210000.00 | 1.9 | 137 | 1 | 0 | 7 | 1 |
| 4 | 65.0 | 1 | 160 | 1 | 20 | 0 | 327000.00 | 2.7 | 116 | 0 | 0 | 8 | 1 |
Initial observations of the dataset show no null/NA values.
df.isnull().sum()age 0
anaemia 0
creatinine_phosphokinase 0
diabetes 0
ejection_fraction 0
high_blood_pressure 0
platelets 0
serum_creatinine 0
serum_sodium 0
sex 0
smoking 0
time 0
DEATH_EVENT 0
dtype: int64
df.dtypesage float64
anaemia int64
creatinine_phosphokinase int64
diabetes int64
ejection_fraction int64
high_blood_pressure int64
platelets float64
serum_creatinine float64
serum_sodium int64
sex int64
smoking int64
time int64
DEATH_EVENT int64
dtype: object
However, the 'age' category is described as a floating point decimal data type, which contradicts the dataset's description of this attribute being an integer. In order so see why this is the case, we will now display all values that are not valid as integers:
df[df['age'] % 1 != 0]['age']185 60.667
188 60.667
Name: age, dtype: float64
As shown, the values in which are decimal are two patients with ages of '60.667'. It has been decided to round these numbers down, to properly fit within the feature's description.
df['age'] = df['age'].apply(np.floor)
df[df['age'] % 1 != 0]['age']Series([], Name: age, dtype: float64)
There are no more decimal numbers in 'age'. We may now convert the data type back to integer.
df['age'] = df['age'].astype('int')
df['age'].dtypesdtype('int32')
The next attribute we will look at is 'creatinine_phosphokinase'. This variable describes the patient's level of creatinine phosphokinase (CPK) in the blood. This is measured in mcg/L (micrograms per Litre) at a range of 23 to 7861 mcg/L.
print(f"""The minimum in this feature is {df['creatinine_phosphokinase'].min()},
the maximum is {df['creatinine_phosphokinase'].max()},
and its data type is \"{df['creatinine_phosphokinase'].dtypes}\".""")The minimum in this feature is 23,
the maximum is 7861,
and its data type is "int64".
In order to optimise the memory usage of this dataframe, we will assume that the amount of CPK in a patient's blood will not exceed that of the maximum number describable by the int16 datatype, which is 32767, and as such convert this feature to this less memory-rich datatype.
While this won't make a significant difference in this dataset of 299 patients, this practice is healthy in the event that a larger database is provided.
df['creatinine_phosphokinase'] = df['creatinine_phosphokinase'].astype('int16')
print(f"""The minimum in this feature is {df['creatinine_phosphokinase'].min()},
the maximum is {df['creatinine_phosphokinase'].max()},
and its data type is \"{df['creatinine_phosphokinase'].dtypes}\".""")The minimum in this feature is 23,
the maximum is 7861,
and its data type is "int16".
We will similarly conduct the same method for the 'age' feature, where the maximum number describable with 'int8' being 127.
df['age'] = df['age'].astype('int8')
print(f"""The minimum in this feature is {df['age'].min()},
the maximum is {df['age'].max()},
and its data type is \"{df['age'].dtypes}\".""")The minimum in this feature is 40,
the maximum is 95,
and its data type is "int8".
And continue this method of integer optimisation for the 'ejection_fraction', 'serum_sodium', and 'time' features.
df['ejection_fraction'] = df['ejection_fraction'].astype('int8')
df['serum_sodium'] = df['serum_sodium'].astype('int16')
df['time'] = df['time'].astype('int16')
df[['ejection_fraction', 'serum_sodium', 'time']].dtypesejection_fraction int8
serum_sodium int16
time int16
dtype: object
Whist exploring the other features, it was noticed that the 'platelets' feature in the dataset are described as "kiloplatelets/mL" (Table 1 of the data's source document), however the dataset given is in platelets/mL. This can easily be changed by dividing each platelet count by 1000.
df['platelets'] = df['platelets'].div(1000).round(2)
df['platelets'].head()0 265.00
1 263.36
2 162.00
3 210.00
4 327.00
Name: platelets, dtype: float64
Another strage pattern that was observed is that there are 25 intances where seperate patients have the exact same platelet count of 263.36 kiloplatelets/mL.
df['platelets'].value_counts()263.36 25
255.00 4
237.00 4
271.00 4
228.00 4
..
155.00 1
257.00 1
309.00 1
461.00 1
336.00 1
Name: platelets, Length: 176, dtype: int64
This is highly obscure and most likely an error with the data set, either with input or otherwise. These patients also display no other correlations, and as such, their observations must be removed from the database in order to ensure the modelling proccess is not given misleading data.
As far as I can tell, this anomaly has been overlooked by the original document.
df[df['platelets'] == 263.36].head()| age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 55 | 0 | 7861 | 0 | 38 | 0 | 263.36 | 1.10 | 136 | 1 | 0 | 6 | 1 |
| 8 | 65 | 0 | 157 | 0 | 65 | 0 | 263.36 | 1.50 | 138 | 0 | 0 | 10 | 1 |
| 24 | 75 | 0 | 582 | 1 | 30 | 1 | 263.36 | 1.83 | 134 | 0 | 0 | 23 | 1 |
| 30 | 94 | 0 | 582 | 1 | 38 | 1 | 263.36 | 1.83 | 134 | 1 | 0 | 27 | 1 |
| 40 | 70 | 0 | 582 | 0 | 20 | 1 | 263.36 | 1.83 | 134 | 1 | 1 | 31 | 1 |
df = df[df['platelets'] != 263.36].reset_index(drop=True)
df[df['platelets'] == 263.36]['platelets'].count()0
print(f"The dataset now contains {df.shape[0]} patients.")The dataset now contains 274 patients.
Similarly, the CPK feature contains 34 instances where the patient's reading is 582 mcg/L.
df['creatinine_phosphokinase'].value_counts()582 34
66 4
129 4
47 3
115 3
..
149 1
151 1
154 1
156 1
1021 1
Name: creatinine_phosphokinase, Length: 197, dtype: int64
This is highly unlikely due to the measurement type being a highly variable statistic. It is again assumed to be erroneous and will be removed from the dataset.
df[df['creatinine_phosphokinase'] == 582].head()| age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 75 | 0 | 582 | 0 | 20 | 1 | 265.0 | 1.9 | 130 | 1 | 0 | 4 | 1 |
| 15 | 45 | 0 | 582 | 0 | 14 | 0 | 166.0 | 0.8 | 127 | 1 | 0 | 14 | 1 |
| 17 | 48 | 1 | 582 | 1 | 55 | 0 | 87.0 | 1.9 | 121 | 0 | 0 | 15 | 1 |
| 31 | 69 | 0 | 582 | 1 | 35 | 0 | 228.0 | 3.5 | 134 | 1 | 0 | 30 | 1 |
| 40 | 50 | 0 | 582 | 1 | 38 | 0 | 310.0 | 1.9 | 135 | 1 | 1 | 35 | 1 |
df = df[df['creatinine_phosphokinase'] != 582].reset_index(drop=True)
df[df['creatinine_phosphokinase'] == 582]['creatinine_phosphokinase'].count()0
print(f"The dataset now contains {df.shape[0]} patients.")The dataset now contains 240 patients.
This section will primarily focus on pre-processing the binary features of the dataset to improve the data exploration section. Before continuing, we will make a copy of the dataframe in its current state such that it can be used for the modelling process later. The features in question are 'anaemia', 'diabetes', 'high_blood_pressure', 'smoking', and 'DEATH_EVENT'.These variables describe a boolean value, where 0 is false and 1 is true. Another binary variable in this dataset describes 'sex' as either a 0 or a 1, however which gender each number signifies is unfortunately not defined in either the UCI Repository nor the data's source document.
These variable values, other than those given in 'sex', will now be given a more descriptive name.
df_before_pp = df.copy()
df[['anaemia', 'diabetes', 'high_blood_pressure', 'smoking', 'DEATH_EVENT']]| anaemia | diabetes | high_blood_pressure | smoking | DEATH_EVENT | |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 1 |
| 1 | 1 | 0 | 0 | 0 | 1 |
| 2 | 1 | 1 | 0 | 0 | 1 |
| 3 | 1 | 0 | 1 | 1 | 1 |
| 4 | 1 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... |
| 235 | 0 | 1 | 1 | 1 | 0 |
| 236 | 0 | 0 | 0 | 0 | 0 |
| 237 | 0 | 1 | 0 | 0 | 0 |
| 238 | 0 | 0 | 0 | 1 | 0 |
| 239 | 0 | 0 | 0 | 1 | 0 |
240 rows Ă— 5 columns
df[['anaemia', 'diabetes', 'high_blood_pressure', 'smoking', 'DEATH_EVENT']] = df[['anaemia', 'diabetes', 'high_blood_pressure', 'smoking', 'DEATH_EVENT']].astype("bool")
# Create categories
cat_anaemia = ['anaemic', 'normal']
cat_diabetes = ['diabetic', 'normal']
cat_hbp = ['hypertensive', 'normal']
cat_smoking = ['smoker', 'non-smoker']
cat_death = ['dead', 'alive']
# Apply categories
df['anaemia'] = np.select([df['anaemia'] == True, df['anaemia'] == False], cat_anaemia)
df['diabetes'] = np.select([df['diabetes'] == True, df['diabetes'] == False], cat_diabetes)
df['high_blood_pressure'] = np.select([df['high_blood_pressure'] == True, df['high_blood_pressure'] == False], cat_hbp)
df['smoking'] = np.select([df['smoking'] == True, df['smoking'] == False], cat_smoking)
df['DEATH_EVENT'] = np.select([df['DEATH_EVENT'] == True, df['DEATH_EVENT'] == False], cat_death)
# Additionally apply DEATH_EVENT category to df_before_pp
df_before_pp['DEATH_EVENT'] = np.select([df_before_pp['DEATH_EVENT'] == 1, df_before_pp['DEATH_EVENT'] == 0], cat_death)
df[['anaemia', 'diabetes', 'high_blood_pressure', 'smoking', 'DEATH_EVENT']]| anaemia | diabetes | high_blood_pressure | smoking | DEATH_EVENT | |
|---|---|---|---|---|---|
| 0 | normal | normal | normal | smoker | dead |
| 1 | anaemic | normal | normal | non-smoker | dead |
| 2 | anaemic | diabetic | normal | non-smoker | dead |
| 3 | anaemic | normal | hypertensive | smoker | dead |
| 4 | anaemic | normal | normal | non-smoker | dead |
| ... | ... | ... | ... | ... | ... |
| 235 | normal | diabetic | hypertensive | smoker | alive |
| 236 | normal | normal | normal | non-smoker | alive |
| 237 | normal | diabetic | normal | non-smoker | alive |
| 238 | normal | normal | normal | smoker | alive |
| 239 | normal | normal | normal | smoker | alive |
240 rows Ă— 5 columns
In this section, for simplicity we will also divide the 'age' feature into two classifications as seperated by the median age: 'middleage': 40-59 and 'elderly': 60+.
print(f"The median age for this dataset is {df['age'].median().astype('int')}.")The median age for this dataset is 60.
conditions = [df['age'] < df['age'].median(), df['age'] >= df['age'].median()]
category = ["middleage", "elderly"]
df['age'] = np.select(conditions, category)
df[['age']].head()| age | |
|---|---|
| 0 | elderly |
| 1 | middleage |
| 2 | elderly |
| 3 | elderly |
| 4 | elderly |
Now that the dataset and dataframe is clean and preprocessed, we are ready for investigation.
# Adjust font size here
font = {'size' : 22}
plt.rc('font', **font)fig01, ax01 = plt.subplots()
ax01.bar(df['age'].unique(), df['age'].value_counts())
plt.title("fig.1 Age Classification Distribution of Dataset")
plt.ylabel("Count")
fig01.set_size_inches(17, 10)
plt.show()df['age'].value_counts()elderly 141
middleage 99
Name: age, dtype: int64
There are significantly more elderly patients in this dataset than middleaged patients.
fig02, ax02 = plt.subplots()
ax02.bar(df['anaemia'].unique(), df['anaemia'].value_counts())
plt.title("fig.2 Anaemic Patient Distribution of Dataset")
plt.ylabel("Count")
fig02.set_size_inches(17, 10)
plt.show()df['anaemia'].value_counts()normal 120
anaemic 120
Name: anaemia, dtype: int64
Exactly 50% of patients in the dataset suffer from anaemia. This is nearly twice the expected world average, which is said to be anywhere from 25% {ekfdiagnostics.com} to 30% {clevelandclinic.org}.
fig03, ax03 = plt.subplots()
ax03.bar(df['high_blood_pressure'].unique(), df['high_blood_pressure'].value_counts())
plt.title("fig.3 Hypertensive Patient Distribution of Dataset")
plt.ylabel("Count")
fig03.set_size_inches(17, 10)
plt.show()df['high_blood_pressure'].value_counts()normal 152
hypertensive 88
Name: high_blood_pressure, dtype: int64
fig04, ax04 = plt.subplots()
ax04.bar(df['diabetes'].unique(), df['diabetes'].value_counts())
plt.title("fig.4 Diabetes Patient Distribution of Dataset")
plt.ylabel("Count")
fig04.set_size_inches(17, 10)
plt.show()fig05, ax05 = plt.subplots()
ax05.bar([str(i) for i in df['sex'].unique()], df['sex'].value_counts())
plt.title("fig.5 Patient Gender Distribution of Dataset")
plt.ylabel("Count")
fig05.set_size_inches(17, 10)
plt.show()There is significantly more of one gender in this dataset than the other, yet once again, we cannot know for sure which gender is which as it is not mentioned in the sources.
fig06, ax06 = plt.subplots()
ax06.bar(df['DEATH_EVENT'].unique(), df['DEATH_EVENT'].value_counts())
plt.title("fig.6 Patient Death Distribution of Dataset")
plt.ylabel("Count")
fig06.set_size_inches(17, 10)
plt.show()fig07, ax07 = plt.subplots()
ax07.boxplot(df['ejection_fraction'], vert = False, flierprops = dict(markerfacecolor='b', marker='D'), labels = ['Patients'])
plt.title("fig.7 Ejection Fraction of Patients")
plt.xlabel("Ejection Fraction (%)")
fig07.set_size_inches(17, 10)
plt.show()fig08, ax08 = plt.subplots()
ax08.boxplot(df['platelets'], vert = False, flierprops = dict(markerfacecolor='b', marker='D'), labels = ['Patients'])
plt.title("fig.8 Platelet Count of Patients")
plt.xlabel("kiloplatelets/mL")
fig08.set_size_inches(17, 10)
plt.show()fig09, ax09 = plt.subplots()
ax09.boxplot(df['time'], vert = False, flierprops = dict(markerfacecolor='b', marker='D'), labels = ['Patients'])
plt.title("fig.9 Follow-Up Period of Patients")
plt.xlabel("Days")
fig09.set_size_inches(17, 10)
plt.show()fig10, ax10 = plt.subplots()
ax10.bar(df['smoking'].unique(), df['smoking'].value_counts(sort = False))
plt.title("fig.10 Smoking Patients Distribution of Dataset")
plt.ylabel("Count")
fig10.set_size_inches(17, 10)
plt.show()df['smoking'].value_counts(sort = False)non-smoker 162
smoker 78
Name: smoking, dtype: int64
print('Percentage of smokers: ', round(df[df['smoking'] == 'smoker'].shape[0] / df.shape[0], 2) * 100,'%.')Percentage of smokers: 33.0 %.
Another huge cause for concern: there are significantly more heart failure patients who smoke than those who don't.
We will proceed with this task beginning by studying the relationship between creatinine phosphokinase (CPK) and serum creatinine.
The normal level of CPK in the blood is 10-120 mcg/L. {mountsinai.org}
fig11a, ax11a = plt.subplots()
ax11a.scatter(df[df['DEATH_EVENT'] == 'alive']['creatinine_phosphokinase'],
df[df['DEATH_EVENT'] == 'alive']['serum_creatinine'],
label = "Alive")
ax11a.scatter(df[df['DEATH_EVENT'] == 'dead']['creatinine_phosphokinase'],
df[df['DEATH_EVENT'] == 'dead']['serum_creatinine'],
label = "Dead")
plt.title("fig.11a Creatinine Presence in Blood of Alive/Deceased Patients")
plt.xlabel("CPK (Creatinine Phosphokinase) in blood (mcg/L)")
plt.ylabel("Serum Creatinine (mg/dL)")
ax11a.legend()
fig11a.set_size_inches(17, 10)
plt.show()fig11b, ax11b = plt.subplots()
ax11b.scatter(df[df['DEATH_EVENT'] == 'alive']['creatinine_phosphokinase'],
df[df['DEATH_EVENT'] == 'alive']['serum_creatinine'],
label = "Alive")
ax11b.scatter(df[df['DEATH_EVENT'] == 'dead']['creatinine_phosphokinase'],
df[df['DEATH_EVENT'] == 'dead']['serum_creatinine'],
label = "Dead")
# Display max normal CPK level dashed line:
plt.axvline(x=120, color='grey', linestyle='--')
plt.title("fig.11b Creatinine Presence in Blood of Alive/Deceased Patients (log(CPK))")
plt.xlabel("CPK (Creatinine Phosphokinase) in blood (mcg/L)")
plt.ylabel("Serum Creatinine (mg/dL)")
ax11b.legend()
fig11b.set_size_inches(17, 10)
ax11b.set_xscale('log')
plt.show()At a glance, it can be seen that the levels of CPK and serum creatinine are inversely related. That is, as the amount of CPK in the blood increases, the amount of Serum Creatine decreases, and vice versa.
normal = df[df['creatinine_phosphokinase'] <= 120]
normal[normal['DEATH_EVENT'] == 'alive'].shape[0]58
fig11c, ax11c = plt.subplots()
normal = df[df['creatinine_phosphokinase'] <= 120]
abnormal = df[df['creatinine_phosphokinase'] > 120]
labels = ['Normal CPK Levels', 'Abnormal CPK Levels']
alive = [normal[normal['DEATH_EVENT'] == 'alive'].shape[0], abnormal[abnormal['DEATH_EVENT'] == 'alive'].shape[0]]
deceased = [normal[normal['DEATH_EVENT'] == 'dead'].shape[0], abnormal[abnormal['DEATH_EVENT'] == 'dead'].shape[0]]
x = np.arange(len(labels))
width = 0.35
rects1 = ax11c.bar(x - width/2, alive, width, label='Alive')
rects2 = ax11c.bar(x + width/2, deceased, width, label='Deceased')
ax11c.set_ylabel('Count')
ax11c.set_title('fig.11c Number of Patients with Normal and Abnormal CPK Levels')
plt.xticks(x)
ax11c.set_xticklabels(labels)
ax11c.legend()
fig11c.set_size_inches(17, 10)
plt.show()print("Mortality increases from ",
round(normal[normal['DEATH_EVENT'] == 'dead'].shape[0] / normal.shape[0], 2) * 100,
"% to ",
round(abnormal[abnormal['DEATH_EVENT'] == 'dead'].shape[0] / abnormal.shape[0], 2) * 100,
"%.")Mortality increases from 25.0 % to 34.0 %.
The target value(s) for this dataset is the 'DEATH_EVENT' feature.
We will be using the random state of '63'.
# Initiallising modelling and target DataFrames
model_df = df_before_pp.loc[:, df.columns != 'DEATH_EVENT']
target_df = df_before_pp['DEATH_EVENT']model_df.shape(240, 12)
target_df.shape(240,)
# Import ML and analysis modules
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report
# Splitting data 90-10: 90% for training, 10% for testing
# With a small dataset, we require as much training as possible
x_train, x_test, y_train, y_test = train_test_split(model_df, target_df, test_size = 0.1, random_state = 63)
x_train, x_test, y_train, y_test( age anaemia creatinine_phosphokinase diabetes ejection_fraction \
186 65 0 56 0 25
155 60 1 104 1 30
90 45 0 292 1 35
190 40 1 129 0 35
235 62 0 61 1 38
.. ... ... ... ... ...
82 65 1 305 0 25
215 65 1 258 1 25
139 59 0 66 1 20
116 46 1 291 0 35
44 70 1 75 0 35
high_blood_pressure platelets serum_creatinine serum_sodium sex \
186 0 237.0 5.0 130 0
155 0 389.0 1.5 136 1
90 0 850.0 1.3 142 1
190 0 255.0 0.9 137 1
235 1 155.0 1.1 143 1
.. ... ... ... ... ...
82 0 298.0 1.1 141 1
215 0 198.0 1.4 129 1
139 0 70.0 2.4 134 1
116 0 348.0 0.9 140 0
44 0 223.0 2.7 138 1
smoking time
186 0 207
155 0 171
90 1 88
190 0 209
235 1 270
.. ... ...
82 0 87
215 0 235
139 0 135
116 0 109
44 1 54
[216 rows x 12 columns],
age anaemia creatinine_phosphokinase diabetes ejection_fraction \
52 42 1 250 1 15
154 58 1 145 0 25
191 53 1 707 0 38
145 60 1 95 0 60
68 51 0 78 0 50
146 69 0 1419 0 40
230 90 1 337 0 38
161 73 1 231 1 30
63 55 0 336 0 45
107 53 1 270 1 35
10 50 1 168 0 38
173 49 0 972 1 35
233 52 0 190 1 38
76 58 0 144 1 38
51 60 0 68 0 20
72 57 1 115 0 25
194 70 0 232 0 30
89 63 0 936 0 38
93 50 0 369 1 25
130 50 0 250 0 25
163 65 0 167 0 30
138 53 0 196 0 60
42 60 1 260 1 38
59 60 1 47 0 20
high_blood_pressure platelets serum_creatinine serum_sodium sex \
52 0 213.0 1.30 136 0
154 0 219.0 1.20 137 1
191 0 330.0 1.40 137 1
145 0 337.0 1.00 138 1
68 0 406.0 0.70 140 1
146 0 105.0 1.00 135 1
230 0 390.0 0.90 144 0
161 0 160.0 1.18 142 1
63 1 324.0 0.90 140 0
107 0 227.0 3.40 145 1
10 1 276.0 1.10 137 1
173 1 268.0 0.80 130 0
233 0 382.0 1.00 140 1
76 1 327.0 0.70 142 0
51 0 119.0 2.90 127 1
72 1 181.0 1.10 144 1
194 0 173.0 1.20 132 1
89 0 304.0 1.10 133 1
93 0 252.0 1.60 136 1
130 0 262.0 1.00 136 1
163 0 259.0 0.80 138 0
138 0 220.0 0.70 133 1
42 0 255.0 2.20 132 0
59 0 204.0 0.70 139 1
smoking time
52 0 65
154 1 170
191 1 209
145 1 146
68 0 79
146 1 147
230 0 256
161 1 180
63 0 74
107 0 105
10 0 11
173 0 187
233 1 258
76 0 83
51 1 64
72 0 79
194 0 210
89 1 88
93 0 90
130 1 120
163 0 186
138 1 134
42 1 45
59 1 73 ,
186 alive
155 dead
90 alive
190 alive
235 alive
...
82 alive
215 dead
139 dead
116 alive
44 alive
Name: DEATH_EVENT, Length: 216, dtype: object,
52 dead
154 dead
191 alive
145 alive
68 alive
146 alive
230 alive
161 alive
63 alive
107 alive
10 dead
173 alive
233 alive
76 alive
51 dead
72 alive
194 alive
89 alive
93 alive
130 alive
163 alive
138 alive
42 dead
59 dead
Name: DEATH_EVENT, dtype: object)
x_train.shape, x_test.shape((216, 12), (24, 12))
y_train.shape, y_test.shape((216,), (24,))
# This loop manually finds an acceptable Nearest Neighbor value
for i in range(1,20):
# Iterate NN & fit
clf = KNeighborsClassifier(i)
clf.fit(x_train, y_train)
# pred --> predicted result from test
pred = clf.predict(x_test)
pred, y_test
# result --> DataFrame containing the predicted vs actual result
result = pd.DataFrame(data = {
'Predicted': pred,
'Actual': y_test
})
# Primary concern: True Negatives (correct death guesses)
# There are unfortunately no cases where the death of a patient
# was predicted successfully more than 3 times.
if result[(result['Predicted'] == 'dead') & (result['Actual'] == 'dead')].shape[0] > 3:
print('Nearest Neighbor: ', i, 'Correct Guesses: ', result[result['Predicted'] == result['Actual']].shape[0], ' Correct Death Guesses: ', result[(result['Predicted'] == 'dead') & (result['Actual'] == 'dead')].shape[0])Nearest Neighbor: 13 Correct Guesses: 21 Correct Death Guesses: 4
Nearest Neighbor: 17 Correct Guesses: 21 Correct Death Guesses: 4
Nearest Neighbor of 13 or 17 is ideal, as it returns an acceptable prediction. We will use 13.
# Set nearest neighbor value
clf = KNeighborsClassifier(13)
# Fit prediction
clf.fit(x_train, y_train)
pred = clf.predict(x_test)
# Confusion Matirx
cm = confusion_matrix(y_test, pred)
cmarray([[17, 1],
[ 2, 4]], dtype=int64)
# Classification report
print(classification_report(y_test, pred)) precision recall f1-score support
alive 0.89 0.94 0.92 18
dead 0.80 0.67 0.73 6
accuracy 0.88 24
macro avg 0.85 0.81 0.82 24
weighted avg 0.87 0.88 0.87 24












