-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathfeature_selection.py
More file actions
127 lines (95 loc) · 5.05 KB
/
feature_selection.py
File metadata and controls
127 lines (95 loc) · 5.05 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
from select import select
import streamlit as st
import pandas as pd
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler # for standardizing the Data
from sklearn.decomposition import PCA # for PCA calculation
# from statsmodels.tsa.stattools import grangercausalitytests
def return_feature_selection():
st.title('Create Correlation plots')
st.markdown("""
Correlation is a statistical term which refers to how close two variables have a linear relationship to each other.
Variables that have a linear relationship tell us less about our dataset, since measuring one tells you something about the other.
In other words, if two variables have a high correlation, we can drop one of the two!
""")
iris_correlation = pd.read_csv('data/iris.csv')
iris_correlation.insert(0,'Time', pd.date_range(start='1/1/2018', periods=iris_correlation.shape[0], freq='T'))
iris_correlation.columns = ['Time','temperature1','temperature2','temperature3','temperature4']
st.table(iris_correlation.head(5).style.format(precision=2)\
.set_table_styles([
{"selector":"caption",
"props":[("text-align","center")],
}
], overwrite=False)\
.set_caption('Table 1.'))
st.markdown("""
Here you can see a correlation table where a 1 means two variables correlate and 0 means they don't.
If you want to test this on your own data, try out the [Data Analytics tool](https://github.com/cslab-hub/DataAnalytics_Diplast)!
""")
corr = iris_correlation.corr().round(2)
corr.style.background_gradient(cmap='coolwarm')
st.table(corr.style.background_gradient(cmap='coolwarm')\
.format(precision=2)\
.set_table_styles([
{"selector":"caption",
"props":[("text-align","center")],
}
], overwrite=False)\
.set_caption('Table 2.'))
st.title('PCA Analysis')
st.markdown('''
Another technique to reduce the dimensionality of your dataset is by performing Principal Component Analysis (PCA).
PCA uses a set of large variables by combining them together to retain as much information as possible.
PCA dates back to the 1990's and is one of the most widely used analysis techniques in Data Science and still counts as one of the leading algorithms for variable reduction.
''')
df = pd.read_csv('data/Turbine_Data.csv', parse_dates=["Unnamed: 0"])
df['DateTime'] = df['Unnamed: 0']
df.drop('Unnamed: 0', axis=1, inplace=True)
df['DateTime'] = pd.to_datetime(df['DateTime'],
format = '%Y-%m-%dT%H:%M:%SZ',
errors = 'coerce')
df = df.iloc[33065:]
df = df.drop(columns=['Blade1PitchAngle', 'Blade2PitchAngle', 'Blade3PitchAngle'])
df.fillna(df.mean(numeric_only=True), inplace=True)
df.drop(columns=['WTG'], axis=1, inplace=True)
df.drop('ControlBoxTemperature', axis=1, inplace=True)
all_values = df.copy()
features = ['WindSpeed', 'RotorRPM', 'ReactivePower', 'GeneratorWinding1Temperature',
'GeneratorWinding2Temperature', 'GeneratorRPM', 'GearboxBearingTemperature', 'GearboxOilTemperature']
# Separate data into Y and X
y = all_values['ActivePower']
X = all_values[features]
st.dataframe(X.head(40))
sc = StandardScaler() # creating a StandardScaler object
X_std = sc.fit_transform(X) # standardizing the data
pca = PCA()
def pcaplotter():
fig, ax = plt.subplots(figsize=(8,3))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.gca().locator_params(nbins=max(ax.get_xlim())+1)
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
return fig
aim_target = 0.98
pca = PCA(n_components = aim_target)
X_pca = pca.fit_transform(X_std) # this will fit and reduce dimensions
st.pyplot(pcaplotter())
pd.DataFrame(pca.components_, columns = X.columns)
n_pcs= pca.n_components_ # get number of component
# get the index of the most important feature on EACH component
most_important = [np.abs(pca.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = X.columns
# get the most important feature names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
most_important_names = list(dict.fromkeys(most_important_names))
for i,j in enumerate(most_important_names):
st.write(f"{i + 1}th most important variable = {j}")
st.markdown("#### Explanation")
st.markdown("""
We can interpret the most important variables in the following way:
With help of only the GeneratorWinding1Temperature variable, we can already explain around 87\% of the variance in the dataset.
And with help of adding ReactivePower, around 94\% can be explained.
Again, this analysis can be performed with you own dataset in the Data Analytics tool!
""")