* catalyst EDW instructions

Taylor Miller · Taylor Miller · commit 84b0bcba68ca · 2017-06-02T14:29:40.000-06:00
* improved getting started, prediction types
* clarification in data pipeline
diff --git a/docs/catalyst_edw_instructions.md b/docs/catalyst_edw_instructions.md
@@ -0,0 +1,43 @@
+# Health Catalyst EDW Instructions
+
+Many of our users operate on and in the Health Catalyst ecosystem, that is heavily based on MSSQL. This document outlines ways to use healthcare.ai in these settings beyond what is in the [getting started](getting_started.md) docs.
+
+## Preparing Your SAM
+
+- If you plan on deploying a model to a MSSQL server (ie, pushing predictions to SQL Server), you will need to setup your tables to receive predictions.
+
+```sql
+CREATE TABLE [SAM].[dbo].[HCAIPredictionClassificationBASE] (
+ [BindingID] [int] ,
+ [BindingNM] [varchar] (255),
+ [LastLoadDTS] [datetime2] (7),
+ [PatientEncounterID] [decimal] (38, 0), --< change to your grain col
+ [PredictedProbNBR] [decimal] (38, 2),
+ [Factor1TXT] [varchar] (255),
+ [Factor2TXT] [varchar] (255),
+ [Factor3TXT] [varchar] (255))
+
+CREATE TABLE [SAM].[dbo].[HCAIPredictionRegressionBASE] (
+ [BindingID] [int],
+ [BindingNM] [varchar] (255),
+ [LastLoadDTS] [datetime2] (7),
+ [PatientEncounterID] [decimal] (38, 0), --< change to your grain col
+ [PredictedValueNBR] [decimal] (38, 2),
+ [Factor1TXT] [varchar] (255),
+ [Factor2TXT] [varchar] (255),
+ [Factor3TXT] [varchar] (255))
+```
+
+## Writing New Predictions to the SAM
+
+By passing the `.predict_to_catalyst_sam()` method a raw prediction dataframe and your database info, the TrainedSupervisedModel will generate predictions with binding ids, grain column and factors and write them to your database.
+
+```python
+# This output is a Health Catalyst EDW specific dataframe that includes grain lumn, the prediction and factors
+server = 'localhost'
+database = 'SAM'
+table = 'HCAIPredictionRegressionBASE'
+schema = 'dbo'
+
+trained_model.predict_to_catalyst_sam(prediction_dataframe, server, database, table, schema)
+```
diff --git a/docs/getting_started.md b/docs/getting_started.md
@@ -1,6 +1,6 @@
-# Getting started with healthcare.ai
+# Getting Started With Healthcare.ai
 
-## What can you do with these tools?
+## What Can You Do With These Tools?
 
 - Fill in missing data via imputation
 - Train and compare models based on your data
@@ -67,38 +67,17 @@ To verify that *healthcareai* installed correctly:
 
 If you did get an error, or run into other installation issues, please [let us know](http://healthcare.ai/contact.html) or better yet post on [Stack Overflow](http://stackoverflow.com/questions/tagged/healthcare-ai) (with the healthcare-ai tag) so we can help others along this process.
 
-## Getting started
-
-- Read through the docs on this site
-- Starting with 
-- Modify the queries and parameters to match your data
-- If you plan on deploying a model to a MSSQL server (ie, pushing predictions to SQL Server), run this in SSMS beforehand:
-
-```sql
-CREATE TABLE [SAM].[dbo].[HCAIPredictionClassificationBASE] (
- [BindingID] [int] ,
- [BindingNM] [varchar] (255),
- [LastLoadDTS] [datetime2] (7),
- [PatientEncounterID] [decimal] (38, 0), --< change to your grain col
- [PredictedProbNBR] [decimal] (38, 2),
- [Factor1TXT] [varchar] (255),
- [Factor2TXT] [varchar] (255),
- [Factor3TXT] [varchar] (255))
-
-CREATE TABLE [SAM].[dbo].[HCAIPredictionRegressionBASE] (
- [BindingID] [int],
- [BindingNM] [varchar] (255),
- [LastLoadDTS] [datetime2] (7),
- [PatientEncounterID] [decimal] (38, 0), --< change to your grain col
- [PredictedValueNBR] [decimal] (38, 2),
- [Factor1TXT] [varchar] (255),
- [Factor2TXT] [varchar] (255),
- [Factor3TXT] [varchar] (255))
-```
-
-- Note that there are examples that write to other databases (MySQL, SQLite)
-
-## For Issues
+## Getting Started
+
+1. Read through the docs on this site.
+2. Start with either `example_regression_1.py` or `example_classification_1.py`
+3. Modify the queries and parameters to match your data.
+4. Decide on what kind of prediction output you want.
+5. Set up your database tables to match the output schema. See the [prediction types](prediction_types.md) document for details.
+    - If you are working in a Health Catalyst EDW ecosystem (primarily MSSQL), please see the [Catalyst EDW Instructions](catalyst_edw_instructions) for SAM setup.
+    - Please see the [databases docs](databases.md) for details about writing to different databases (MSSQL, MySQL, SQLite, CSV)
+
+## Where to Get Help
 
 - Double check that the code follows the examples in these documents.
 - If you're still seeing an error, file an issue on [Stack Overflow](http://stackoverflow.com/) using the healthcare-ai tag. Please provide
diff --git a/docs/prediction_types.md b/docs/prediction_types.md
@@ -4,9 +4,17 @@ Healthcareai provides a few options when you want to get predictions from a trai
 
 Please note that you will likely only need one of these prediction output types.
 
+## Database Setup
+
+Each prediction type has a different set of columns and types. You will need to set up your database tables to receive these with appropriate data types.
+
+An easy way to understand each of the prediction types is to inspect the `.dtypes` property of each returned dataframe. For example: `print(predictions.dtypes)`.
+
+## Prediction Types
+
 Each prediction output format is detailed below.
 
-## Predictions Only
+### Predictions Only
 
 By passing the `.make_predictions(prediction_dataframe)` method a raw prediction dataframe you'll get back a dataframe containing the grain id and predicted values.
 
@@ -16,7 +24,7 @@ predictions = trained_model.make_predictions(prediction_dataframe)
 print(predictions.head())
 ```
 
-## Important Factors
+### Important Factors
 
 By passing the `.make_factors(prediction_dataframe)` method a raw prediction dataframe you'll get back a dataframe containing the grain id and top predictive factors.
 
@@ -26,7 +34,7 @@ factors = trained_model.make_factors(prediction_dataframe)
 print(factors.head())
 ```
 
-## Predictions + Factors
+### Predictions + Factors
 
 By passing the `.make_predictions_with_k_factors(prediction_dataframe)` method a raw prediction dataframe you'll get back a dataframe containing the grain id and predicted values, and top factors.
 
@@ -36,7 +44,7 @@ predictions_with_factors_df = trained_model.make_predictions_with_k_factors(pred
 print(predictions_with_factors_df.head())
 ```
 
-## Original Dataframe + Predictions + Factors
+### Original Dataframe + Predictions + Factors
 
 By passing the `.make_original_with_predictions_and_factors(prediction_dataframe)` method a raw prediction dataframe you'll get back a dataframe containing all the original data, the predicted values, and top factors.
 
@@ -47,9 +55,16 @@ original_plus_predictions_and_factors = trained_model.make_original_with_predict
 print(original_plus_predictions_and_factors.head())
 ```
 
+### Health Catalyst EDW Format
 
+Many of our users operate on and in the Health Catalyst ecosystem, and most have standardized on a table format that others may find useful. Please note that if you do intend to use this specific format there is an easier and more robust way to save this to your databaes outlined in the [Health Catalyst EDW Instructions](catalyst_edw_instructions.md).
 
+By passing the `.create_catalyst_dataframe(prediction_dataframe)` method a raw prediction dataframe you'll get back a dataframe containing all the original data, the predicted values, and top factors.
 
-
-
+```python
+## Health Catalyst EDW specific instructions. Uncomment to use.
+# This output is a Health Catalyst EDW specific dataframe that includes grain lumn, the prediction and factors
+catalyst_dataframe = trained_model.create_catalyst_dataframe(ediction_dataframe)
+print(catalyst_dataframe.head())
+```
 
diff --git a/healthcareai/pipelines/data_preparation.py b/healthcareai/pipelines/data_preparation.py
@@ -24,7 +24,7 @@ def full_pipeline(model_type, predicted_column, grain_column, impute=True):
         ('null_row_filter', hcai_filters.DataframeNullValueFilter(excluded_columns=None)),
         ('convert_target_to_binary', hcai_transformers.DataFrameConvertTargetToBinary(model_type, predicted_column)),
         ('prediction_to_numeric', hcai_transformers.DataFrameConvertColumnToNumeric(predicted_column)),
-        ('create_dummy_variables', hcai_transformers.DataFrameCreateDummyVariables([predicted_column])),
+        ('create_dummy_variables', hcai_transformers.DataFrameCreateDummyVariables(excluded_columns=[predicted_column])),
     ])
     return pipeline