Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/Project_2.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/inspectionProfiles/profiles_settings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

282 changes: 282 additions & 0 deletions Project2.ipynb

Large diffs are not rendered by default.

127 changes: 109 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,120 @@
# Project 2

Select one of the following two options:

## Boosting Trees
## TEAM NAME
ILYA

Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
## TEAM
- Rakesh Reddy - A20525389 (Worked on models output and test file, file structure, took care of readme format and content, project delivery)
- Geeta Hade - A20580824 (done initial math and research on how to build models, took care of readme content)
- Nishant Khandhar - A20581012 (heavily invested time on building code for the Kfold (we faced an issue in it so he took care of it), visualizations )
- Amogh Vastrad - A20588808 (he overlooked and researched the model performance and performed coding for notebook file)

Put your README below. Answer the following questions.
## Project Overview

* What does the model you have implemented do and when should it be used?
* How did you test your model to determine if it is working reasonably correctly?
* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
In this project, we have implemented generic k-fold cross-validation and bootstrapping methods for model selection, which test how well a model is likely to perform on new data. We have also included the Akaike Information Criterion (AIC) to compare models. This model evaluates the performance of regression models and compares it with the AIC.

## Model Selection

Implement generic k-fold cross-validation and bootstrapping model selection methods.
#### Prerequisites
- Install Python and required libraries: `numpy`, `matplotlib`, `sklearn`.
- Open your terminal or command prompt.
- Navigate to your project folder where the `requirements.txt` file is located.
- Run the following command to install all the libraries listed in `requirements.txt`:

In your README, answer the following questions:

* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
* In what cases might the methods you've written fail or give incorrect or undesirable results?
* What could you implement given more time to mitigate these cases or help users of your methods?
* What parameters have you exposed to your users in order to use your model selectors.
```bash
pip install -r requirements.txt
```
- python version >=3.9

See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.

As usual, above-and-beyond efforts will be considered for bonus points.

## How to run the code

### Using a Notebook File


#### Steps to Execute
1. Open the terminal at the project folder.
2. Run the command: `jupyter notebook Project2.ipynb`
3. Run all cells sequentially.
4. View the model evaluation outputs (e.g., MSE scores, AIC values) and plots for insights into model performance and residuals.
5. It contains visualizations.

![alt text](image-1.png)

![alt text](image-2.png)

### Using Python Class



#### Steps to Execute
1. Go to the `src` folder.
2. Execute the command: `python .\pythonModelTest.py`

![alt text](image.png)

## Outputs

1. **Cross-Validation:**
- MSE Scores: [0.2473, 0.2475, 0.2448, 0.2401, 0.2326]
- Average MSE: 0.2425

2. **Bootstrapping:**
- MSE Scores (first 5): [0.2408, 0.2316, 0.2475, 0.2507, 0.2412]
- Average MSE: 0.2397

3. **AIC:**
- Value: -6247.93

4. **Visualizations:**
- Residual distribution plot.
- Error comparison across methods.
- Bootstrapping confidence intervals.

## Difference Between Bootstrapping and K-Fold Cross-Validation

### Bootstrapping
Bootstrapping is a powerful resampling method where we repeatedly draw samples from our dataset with replacement. Each sample is used to train the model, and its performance is evaluated on the remaining data. By repeating this process many times (e.g., 150 iterations), we can average the results to get a reliable estimate of the model's performance.

### K-Fold Cross-Validation
K-Fold Cross-Validation is a technique where we split the dataset into `k` equal-sized folds. The model is trained on `k-1` folds and tested on the remaining fold. This process is repeated `k` times, ensuring each fold is used once as the test set. By averaging the results, we can estimate the model's performance more accurately.



##



## Key Questions and Answers

1. **Do Cross-Validation and Bootstrapping agree with AIC?**
- Yes, in our project, we found that all three methods generally agree. For instance, both cross-validation and bootstrapping show lower MSE scores for the chosen model. The AIC value of -6247.93 supports this by indicating a good model fit, aligning well with the performance metrics from cross-validation and bootstrapping.

2. **In what cases might these methods fail?**
- **Cross-Validation:** May not perform well with very small datasets where the training data in each fold is insufficient to build a reliable model.
- **Bootstrapping:** Assumes that the data points are independently and identically distributed (i.i.d.); it may not be suitable for datasets with dependencies, such as time series data.
- **AIC:** Assumes that the residuals of the model are normally distributed; it may provide misleading results for models with non-linear relationships or non-normal residuals.

3. **How can we address these challenges?**
- For imbalanced datasets, consider using stratified k-fold cross-validation to ensure each fold has a representative distribution of the target variable.
- For datasets with dependencies, such as time series data, use block bootstrapping to maintain the inherent structure and relationships within the data.
- To improve the reliability of AIC, especially for small sample sizes, use corrected versions like AICc (Akaike Information Criterion corrected).

4. **What parameters can you customize in our project?**
- **Cross-Validation:** You can set the number of folds (`k`), with a default value of 5.
- **Bootstrapping:** You can specify the number of iterations (`num_iterations`), which defaults to 100, and set a seed for reproducibility.
- **Regression Models:** You can adjust the regularization parameter (`alpha`) for Ridge regression, with a default value of 1.0.

## Additional Features
- Synthetic data generator
- Synthetic data generator for flexible testing.
- Residual analysis and visualization tools for insights.
- Extendable framework for adding new model selectors.

## Compatibility
- Tested on Windows
- Requires `numpy`, `matplotlib`, `Jupyter`, `sklearn`

## More data
- if you had time check data folder for various plot and results
101 changes: 101 additions & 0 deletions data/bootstrap_results.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
Bootstrap R^2,Bootstrap MAE,Bootstrap MSE
0.9991926631507461,1.5559609032253738,3.8154968965175544
0.9991941097989808,1.5570881569520012,3.808659995841913
0.9991905617647121,1.5620862597054892,3.8254281066418216
0.9991897289030771,1.5595151918472738,3.8293642344011176
0.9991911439556526,1.558498454036697,3.822676655709695
0.9991900395206639,1.5649364589673702,3.8278962468573927
0.9991928211178499,1.5583275209068779,3.8147503394950912
0.999191828935048,1.5571344864238779,3.819439423617633
0.9991923931227429,1.557577928408874,3.816773056535697
0.9991903703422763,1.5617699459168035,3.826332774514347
0.9991907291771531,1.5630784232780783,3.8246369106872407
0.9991866373199884,1.5606432017102552,3.843975143950856
0.9991897975864277,1.5580396509041157,3.829039634934189
0.9991896819944064,1.5592877515213486,3.8295859261123657
0.999190330982975,1.5552007415130549,3.8265187876909086
0.9991869669053676,1.5645980152329266,3.8424175140808763
0.999192998325308,1.55908343351559,3.8139128520116974
0.9991903568924005,1.5587913088233658,3.8263963390093667
0.9991907486771314,1.5636324125597667,3.824544753235753
0.9991916563448066,1.558463378715684,3.8202550900043453
0.9991888504321821,1.5588667408746362,3.8335159128200553
0.9991914803732785,1.556453177736431,3.8210867364469943
0.9991930735084311,1.555852363322278,3.813557534434316
0.9991938505176341,1.558985306781061,3.8098853668561463
0.9991881103385182,1.5668349469064593,3.837013616511463
0.999183042799408,1.5699507255947016,3.860962950380935
0.9991907192419778,1.561860543072286,3.824683864608038
0.9991882706376851,1.5627295755821506,3.8362560380920048
0.9991894051772129,1.5643990886614607,3.830894172036893
0.9991900977655868,1.5567653577996403,3.8276209796962295
0.9991910414400481,1.5622588347176676,3.8231611473700466
0.9991910496436559,1.562579302847192,3.8231223768862463
0.9991927222256474,1.5602218788672608,3.8152177068550506
0.9991927518210861,1.5563821677026215,3.8150778379700934
0.9991907016369463,1.5619242089952567,3.8247670664871816
0.9991896383485277,1.5628964865467605,3.8297921977757876
0.9991889018516065,1.5562006275125055,3.8332729031586554
0.999191128092668,1.5583747336401037,3.822751624625809
0.9991889178862602,1.5658306454228643,3.8331971229293518
0.9991911383186185,1.5597377625413358,3.822703296492906
0.9991918362263862,1.557979490569291,3.81940496454599
0.9991862754043846,1.5625369199437626,3.8456855673807286
0.9991926336810646,1.5600408214395807,3.815636171069774
0.9991873386552133,1.565718549446457,3.840660613743814
0.9991813123531637,1.5743638167250789,3.869141088516039
0.9991890105244606,1.5622424999351268,3.832759312161178
0.9991914674485032,1.559646735472396,3.82114781930274
0.9991884770081165,1.5655222925280652,3.8352807255675114
0.9991896991548623,1.5610023960441808,3.82950482530977
0.9991909971013361,1.5625810213290656,3.8233706933835157
0.9991890770723222,1.5613958463076585,3.8324448050759274
0.9991919926270098,1.5639074756737172,3.8186658107538087
0.9991898164301484,1.558192006146937,3.8289505789744416
0.9991876193121146,1.5685421313041834,3.839334221251023
0.9991910355225598,1.5588748361598657,3.8231891135877167
0.9991934222389026,1.5587722397104415,3.811909424314283
0.9991922888921104,1.5602429975475147,3.817265653467282
0.9991911407358479,1.5590656022606586,3.8226918725977623
0.9991926618901131,1.559129473851755,3.815502854304773
0.9991851669515363,1.5692264886498593,3.8509241470475972
0.9991905442007555,1.5600511395530308,3.8255111143998355
0.9991925747155832,1.5570849496654027,3.81591484361078
0.9991849730386437,1.5682866287510164,3.8518405848891635
0.9991909217763177,1.5577459597423977,3.823726681561072
0.9991891819648242,1.5604091214240816,3.831949080130287
0.9991909271629988,1.5607459554238463,3.8237012239530426
0.9991922410905385,1.5576199261274624,3.817491565054786
0.9991920965697888,1.5592393739649937,3.8181745742257434
0.9991914655400295,1.5603729612623665,3.821156838804136
0.9991944283531847,1.5571180712211172,3.80715449962018
0.999191261080024,1.561239488407396,3.822123122601127
0.9991926159639264,1.5610969037345404,3.8157199027683335
0.9991913435835446,1.5607628541195295,3.821733208618275
0.9991873777888308,1.5617804942616058,3.840475667155034
0.99919155915556,1.559979354513557,3.8207144091455447
0.9991912630141409,1.5568318427214374,3.822113981910038
0.9991935150667014,1.5576844514875798,3.8114707174988443
0.9991898976297084,1.558070139361835,3.8285668275456084
0.9991922710215164,1.559736166167988,3.8173501104022862
0.9991922702817089,1.5626523569893456,3.817353606754052
0.9991901240278128,1.5588139716839955,3.8274968636696913
0.9991934409112616,1.5604623288522128,3.8118211782149864
0.9991926898639851,1.5595777766960843,3.815370648990614
0.9991844330118544,1.567920307224953,3.8543927668448794
0.9991932144963368,1.557496812820303,3.812891221584708
0.9991917134674377,1.5567336783968853,3.8199851268254514
0.9991890066526996,1.5603319251560341,3.83277761021376
0.9991889793380094,1.5646794684839243,3.832906700216244
0.9991862268950137,1.5642539564071207,3.845914824046804
0.9991875362452204,1.5646380973337448,3.8397267977539746
0.9991933572472644,1.5578480782503734,3.812216576643571
0.9991898258303568,1.5650537823670698,3.828906153322433
0.9991907786739209,1.5596782402302933,3.8244029875556453
0.9991894349410514,1.562821145965995,3.830753507289892
0.9991925384762144,1.5563961407207274,3.8160861118980667
0.9991919304245226,1.5615159053287093,3.818959781476203
0.9991876730223185,1.564511279561738,3.839080385301033
0.9991840232228562,1.5668678070175237,3.8563294412978486
0.9991912877000044,1.5580539602942705,3.821997315816789
0.9991911278113924,1.5577427025924737,3.8227529539417655
Loading