Skip to content

Commit edf08d6

Browse files
authored
Merge pull request #24 from sintel-dev/revised-api
New Zephyr API - One Zephyr class - encapsulates entire predictive engineering workflow. - stores user state - wrapped by GuideHandler.guide_step - GuideHandler.guide_step helps manage the users flow of steps and helps ensure that the actual Zephyr state remains consistent - Provides helpful logging so that users can understand what steps they should perform to make progress
2 parents 40e6c7f + 154e280 commit edf08d6

36 files changed

+3844
-1078
lines changed

README.md

Lines changed: 88 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -13,26 +13,26 @@
1313

1414
A machine learning library for assisting in the generation of machine learning problems for wind farms operations data by analyzing past occurrences of events.
1515

16-
| Important Links | |
17-
| ----------------------------------- | -------------------------------------------------------------------- |
18-
| :computer: **[Website]** | Check out the Sintel Website for more information about the project. |
19-
| :book: **[Documentation]** | Quickstarts, User and Development Guides, and API Reference. |
20-
| :star: **[Tutorials]** | Checkout our notebooks |
21-
| :octocat: **[Repository]** | The link to the Github Repository of this library. |
22-
| :scroll: **[License]** | The repository is published under the MIT License. |
23-
| :keyboard: **[Development Status]** | This software is in its Pre-Alpha stage. |
24-
| ![][Slack Logo] **[Community]** | Join our Slack Workspace for announcements and discussions. |
25-
26-
[Website]: https://sintel.dev/
27-
[Documentation]: https://dtail.gitbook.io/zephyr/
28-
[Repository]: https://github.com/sintel-dev/Zephyr
29-
[Tutorials]: https://github.com/sintel-dev/Zephyr/blob/master/notebooks
30-
[License]: https://github.com/sintel-dev/Zephyr/blob/master/LICENSE
31-
[Development Status]: https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha
32-
[Community]: https://join.slack.com/t/sintel-space/shared_invite/zt-q147oimb-4HcphcxPfDAM0O9_4PaUtw
33-
[Slack Logo]: https://github.com/sintel-dev/Orion/blob/master/docs/images/slack.png
34-
35-
- Homepage: https://github.com/signals-dev/zephyr
16+
| Important Links | |
17+
| ----------------------------------- | -------------------------------------------------------------------- |
18+
| :computer: **[Website]** | Check out the Sintel Website for more information about the project. |
19+
| :book: **[Documentation]** | Quickstarts, User and Development Guides, and API Reference. |
20+
| :star: **[Tutorials]** | Checkout our notebooks |
21+
| :octocat: **[Repository]** | The link to the Github Repository of this library. |
22+
| :scroll: **[License]** | The repository is published under the MIT License. |
23+
| :keyboard: **[Development Status]** | This software is in its Pre-Alpha stage. |
24+
| ![][Slack Logo] **[Community]** | Join our Slack Workspace for announcements and discussions. |
25+
26+
[Website]: https://sintel.dev/
27+
[Documentation]: https://dtail.gitbook.io/zephyr/
28+
[Repository]: https://github.com/sintel-dev/Zephyr
29+
[Tutorials]: https://github.com/sintel-dev/Zephyr/blob/master/notebooks
30+
[License]: https://github.com/sintel-dev/Zephyr/blob/master/LICENSE
31+
[Development Status]: https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha
32+
[Community]: https://join.slack.com/t/sintel-space/shared_invite/zt-q147oimb-4HcphcxPfDAM0O9_4PaUtw
33+
[Slack Logo]: https://github.com/sintel-dev/Orion/blob/master/docs/images/slack.png
34+
35+
- Homepage: https://github.com/signals-dev/zephyr
3636

3737
# Overview
3838

@@ -42,26 +42,25 @@ occurrences of events.
4242

4343
The main features of **Zephyr** are:
4444

45-
* **EntitySet creation**: tools designed to represent wind farm data and the relationship
46-
between different tables. We have functions to create EntitySets for datasets with PI data
47-
and datasets using SCADA data.
48-
* **Labeling Functions**: a collection of functions, as well as tools to create custom versions
49-
of them, ready to be used to analyze past operations data in the search for occurrences of
50-
specific types of events in the past.
51-
* **Prediction Engineering**: a flexible framework designed to apply labeling functions on
52-
wind turbine operations data in a number of different ways to create labels for custom
53-
Machine Learning problems.
54-
* **Feature Engineering**: a guide to using Featuretools to apply automated feature engineerinig
55-
to wind farm data.
45+
- **EntitySet creation**: tools designed to represent wind farm data and the relationship
46+
between different tables. We have functions to create EntitySets for datasets with PI data
47+
and datasets using SCADA data.
48+
- **Labeling Functions**: a collection of functions, as well as tools to create custom versions
49+
of them, ready to be used to analyze past operations data in the search for occurrences of
50+
specific types of events in the past.
51+
- **Prediction Engineering**: a flexible framework designed to apply labeling functions on
52+
wind turbine operations data in a number of different ways to create labels for custom
53+
Machine Learning problems.
54+
- **Feature Engineering**: a guide to using Featuretools to apply automated feature engineerinig
55+
to wind farm data.
5656

5757
# Install
5858

5959
## Requirements
6060

6161
**Zephyr** has been developed and runs on Python 3.8, 3.9, 3.10, 3.11 and 3.12.
6262

63-
Also, although it is not strictly required, the usage of a [virtualenv](
64-
https://virtualenv.pypa.io/en/latest/) is highly recommended in order to avoid interfering
63+
Also, although it is not strictly required, the usage of a [virtualenv](https://virtualenv.pypa.io/en/latest/) is highly recommended in order to avoid interfering
6564
with other software installed in the system where you are trying to run **Zephyr**.
6665

6766
## Download and Install
@@ -79,35 +78,38 @@ If you want to install from source or contribute to the project please read the
7978
# Quickstart
8079

8180
In this short tutorial we will guide you through a series of steps that will help you
82-
getting started with **Zephyr**.
81+
getting started with **Zephyr**. For more detailed examples, please refer to the tutorial notebooks in the `notebooks` directory:
82+
83+
- `feature_engineering.ipynb`: Learn how to create EntitySets and perform feature engineering
84+
- `modeling.ipynb`: Learn how to train and evaluate models
85+
- `visualization.ipynb`: Learn how to visualize your data and results
8386

8487
## 1. Loading the data
8588

86-
The first step we will be to use preprocessed data to create an EntitySet. Depending on the
87-
type of data, we will either the `zephyr_ml.create_pidata_entityset` or `zephyr_ml.create_scada_entityset`
88-
functions.
89+
The first step will be to use preprocessed data to create an EntitySet. Depending on the
90+
type of data, we will use either the `generate_entityset` function with `es_type="pidata"`, `es_type="scada"` or `es_type="vibrations"`.
8991

9092
**NOTE**: if you cloned the **Zephyr** repository, you will find some demo data inside the
91-
`notebooks/data` folder which has been preprocessed to fit the `create_entityset` data
92-
requirements.
93+
`notebooks/data` folder which has been preprocessed to fit the data requirements.
9394

94-
```python3
95+
```python
9596
import os
9697
import pandas as pd
97-
from zephyr_ml import create_scada_entityset
98+
from zephyr_ml import Zephyr
9899

99100
data_path = 'notebooks/data'
100101

101102
data = {
102-
'turbines': pd.read_csv(os.path.join(data_path, 'turbines.csv')),
103-
'alarms': pd.read_csv(os.path.join(data_path, 'alarms.csv')),
104-
'work_orders': pd.read_csv(os.path.join(data_path, 'work_orders.csv')),
105-
'stoppages': pd.read_csv(os.path.join(data_path, 'stoppages.csv')),
106-
'notifications': pd.read_csv(os.path.join(data_path, 'notifications.csv')),
107-
'scada': pd.read_csv(os.path.join(data_path, 'scada.csv'))
103+
'turbines': pd.read_csv(os.path.join(data_path, 'turbines.csv')),
104+
'alarms': pd.read_csv(os.path.join(data_path, 'alarms.csv')),
105+
'work_orders': pd.read_csv(os.path.join(data_path, 'work_orders.csv')),
106+
'stoppages': pd.read_csv(os.path.join(data_path, 'stoppages.csv')),
107+
'notifications': pd.read_csv(os.path.join(data_path, 'notifications.csv')),
108+
'scada': pd.read_csv(os.path.join(data_path, 'scada.csv'))
108109
}
109110

110-
scada_es = create_scada_entityset(data)
111+
zephyr = Zephyr()
112+
scada_es = zephyr.generate_entityset(data, es_type="scada")
111113
```
112114

113115
This will load the turbine, alarms, stoppages, work order, notifications, and SCADA data, and return it
@@ -132,15 +134,10 @@ Entityset: SCADA data
132134

133135
## 2. Selecting a Labeling Function
134136

135-
The second step will be to choose an adequate **Labeling Function**.
136-
137-
We can see the list of available labeling functions using the `zephyr_ml.labeling.get_labeling_functions`
138-
function.
139-
140-
```python3
141-
from zephyr_ml import labeling
137+
The second step will be to choose an adequate **Labeling Function**. We can see the list of available labeling functions using the `GET_LABELING_FUNCTIONS` method.
142138

143-
labeling.get_labeling_functions()
139+
```python
140+
labeling_functions = zephyr.GET_LABELING_FUNCTIONS()
144141
```
145142

146143
This will return us a dictionary with the name and a short description of each available
@@ -158,14 +155,14 @@ amount of power lost over a slice of time.
158155
## 3. Generate Target Times
159156

160157
Once we have loaded the data and the Labeling Function, we are ready to start using
161-
the `zephyr_ml.generate_labels` function to generate a Target Times table.
158+
the `generate_label_times` function to generate a Target Times table.
162159

163-
164-
```python3
165-
from zephyr_ml import DataLabeler
166-
167-
data_labeler = DataLabeler(labeling.labeling_functions.total_power_loss)
168-
target_times, metadata = data_labeler.generate_label_times(scada_es)
160+
```python
161+
target_times, metadata = zephyr.generate_label_times(
162+
labeling_fn="total_power_loss", # or any other labeling function name
163+
num_samples=10,
164+
gap="20d"
165+
)
169166
```
170167

171168
This will return us a `compose.LabelTimes` containing the three columns required to start
@@ -177,66 +174,63 @@ working on a Machine Learning problem: the turbine ID (COD_ELEMENT), the cutoff
177174
```
178175

179176
## 4. Feature Engineering
180-
Using EntitySets and LabelTimes allows us to easily use Featuretools for automatic feature generation.
181177

182-
```python3
183-
import featuretools as ft
178+
Using EntitySets and LabelTimes allows us to easily use Featuretools for automatic feature generation.
184179

185-
feature_matrix, features = ft.dfs(
186-
entityset=scada_es,
187-
target_dataframe_name='turbines',
180+
```python
181+
feature_matrix, features, _ = zephyr.generate_feature_matrix(
182+
target_dataframe_name="turbines",
188183
cutoff_time_in_index=True,
189-
cutoff_time=target_times,
190-
max_features=20
184+
agg_primitives=["count", "sum", "max"],
185+
max_features = 20,
186+
verbose=True
191187
)
192188
```
193189

194190
Then we get a list of features and the computed `feature_matrix`.
195191

196192
```
197193
TURBINE_PI_ID TURBINE_LOCAL_ID TURBINE_SAP_COD DES_CORE_ELEMENT SITE DES_CORE_PLANT ... MODE(alarms.COD_STATUS) MODE(alarms.DES_NAME) MODE(alarms.DES_TITLE) NUM_UNIQUE(alarms.COD_ALARM) NUM_UNIQUE(alarms.COD_ALARM_INT) label
198-
COD_ELEMENT time ...
194+
COD_ELEMENT time ...
199195
0 2022-01-01 TA00 A0 LOC000 T00 LOCATION LOC ... Alarm1 Alarm1 Description of alarm 1 1 1 45801.0
200196
201197
[1 rows x 21 columns]
202198
```
203199

204-
205200
## 5. Modeling
206201

207-
Once we have the feature matrix, we can train a model using the Zephyr interface where you can train, infer, and evaluate a pipeline.
208-
First, we need to prepare our dataset for training by creating ``X`` and ``y`` variables and one-hot encoding features.
202+
Once we have the feature matrix, we can train a model using the Zephyr interface. First, we need to prepare our dataset for training by creating a train-test split.
209203

210-
```python3
211-
y = list(feature_matrix.pop('label'))
212-
X = pd.get_dummies(feature_matrix).values
204+
```python
205+
X_train, X_test, y_train, y_test = zephyr.generate_train_test_split(
206+
test_size=0.2,
207+
random_state=42
208+
)
213209
```
214210

215-
In this example, we will use an 'xgb' regression pipeline to predict total power loss.
216-
217-
```python3
218-
from zephyr_ml import Zephyr
211+
In this example, we will use an 'xgb' regression pipeline to predict total power loss. To train the pipeline, we simply call the `fit_pipeline` method.
219212

220-
pipeline_name = 'xgb_regressor'
213+
```python
214+
zephyr.fit_pipeline(
215+
pipeline="xgb_regressor",
216+
pipeline_hyperparameters=None,
221217

222-
zephyr = Zephyr(pipeline_name)
218+
)
223219
```
224220

225-
To train the pipeline, we simply use the `fit` function.
226-
```python3
227-
zephyr.fit(X, y)
221+
After it finished training, we can make predictions using `predict`
222+
223+
```python
224+
y_pred = zephyr.predict(X_test)
228225
```
229226

230-
After it finished training, we can make prediciton using `predict`
227+
We can also use `evaluate` to obtain the performance of the pipeline.
231228

232-
```python3
233-
y_pred = zephyr.predict(X)
229+
```python
230+
results = zephyr.evaluate()
234231
```
235232

236-
We can also use ``zephyr.evaluate`` to obtain the performance of the pipeline.
237-
238233
# What's Next?
239234

240235
If you want to continue learning about **Zephyr** and all its
241-
features please have a look at the tutorials found inside the [notebooks folder](
242-
https://github.com/signals-dev/zephyr/tree/main/notebooks).
236+
features please have a look at the tutorials found inside the [notebooks folder](https://github.com/signals-dev/zephyr/tree/main/notebooks).

demo.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
from os import path
2+
import pandas as pd
3+
from zephyr_ml import create_scada_entityset
4+
5+
data_path = "notebooks/data"
6+
7+
data = {
8+
"turbines": pd.read_csv(path.join(data_path, "turbines.csv")),
9+
"alarms": pd.read_csv(path.join(data_path, "alarms.csv")),
10+
"work_orders": pd.read_csv(path.join(data_path, "work_orders.csv")),
11+
"stoppages": pd.read_csv(path.join(data_path, "stoppages.csv")),
12+
"notifications": pd.read_csv(path.join(data_path, "notifications.csv")),
13+
"scada": pd.read_csv(path.join(data_path, "scada.csv")),
14+
}
15+
scada_es = create_scada_entityset(data)
16+
17+
print(scada_es)

0 commit comments

Comments
 (0)