Fall2024CS584 · rakeshreddy06 · Nov 17, 2024 · Nov 18, 2024 · Nov 21, 2024 · Nov 21, 2024
diff --git a/.idea/.gitignore b/.idea/.gitignore
diff --git a/.idea/Project_2.iml b/.idea/Project_2.iml
diff --git a/.idea/inspectionProfiles/profiles_settings.xml b/.idea/inspectionProfiles/profiles_settings.xml
diff --git a/.idea/misc.xml b/.idea/misc.xml
diff --git a/.idea/modules.xml b/.idea/modules.xml
diff --git a/Project2.ipynb b/Project2.ipynb
diff --git a/README.md b/README.md
@@ -1,29 +1,120 @@
-# Project 2
 
-Select one of the following two options:
 
-## Boosting Trees
+## TEAM NAME
+ILYA
 
-Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
+## TEAM
+- Rakesh Reddy - A20525389   (Worked on models output and test file, file structure, took care of readme format and content, project delivery)
+- Geeta Hade - A20580824  (done initial math and research on how to build models, took care of readme content)
+- Nishant Khandhar - A20581012 (heavily invested time on building code for the Kfold (we faced an issue in it so he took care of it), visualizations )
+- Amogh Vastrad - A20588808  (he overlooked and researched the model performance and performed coding for notebook file)
 
-Put your README below. Answer the following questions.
+## Project Overview
 
-* What does the model you have implemented do and when should it be used?
-* How did you test your model to determine if it is working reasonably correctly?
-* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
-* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
+In this project, we have implemented generic k-fold cross-validation and bootstrapping methods for model selection, which test how well a model is likely to perform on new data. We have also included the Akaike Information Criterion (AIC) to compare models. This model evaluates the performance of regression models and compares it with the AIC.
 
-## Model Selection
 
-Implement generic k-fold cross-validation and bootstrapping model selection methods.
+#### Prerequisites
+- Install Python and required libraries: `numpy`, `matplotlib`, `sklearn`.
+- Open your terminal or command prompt.
+- Navigate to your project folder where the `requirements.txt` file is located.
+- Run the following command to install all the libraries listed in `requirements.txt`:
 
-In your README, answer the following questions:
 
-* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
-* In what cases might the methods you've written fail or give incorrect or undesirable results?
-* What could you implement given more time to mitigate these cases or help users of your methods?
-* What parameters have you exposed to your users in order to use your model selectors.
+```bash
+pip install -r requirements.txt
+```
+- python version >=3.9
 
-See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.
 
-As usual, above-and-beyond efforts will be considered for bonus points.
+
+## How to run the code
+
+### Using a Notebook File
+
+
+#### Steps to Execute
+1. Open the terminal at the project folder.
+2. Run the command: `jupyter notebook Project2.ipynb`
+3. Run all cells sequentially.
+4. View the model evaluation outputs (e.g., MSE scores, AIC values) and plots for insights into model performance and residuals.
+5. It contains visualizations.
+
+![alt text](image-1.png)
+
+![alt text](image-2.png)
+
+### Using Python Class
+
+
+
+#### Steps to Execute
+1. Go to the `src` folder.
+2. Execute the command: `python .\pythonModelTest.py`
+
+![alt text](image.png)
+
+## Outputs
+
+1. **Cross-Validation:**
+    - MSE Scores: [0.2473, 0.2475, 0.2448, 0.2401, 0.2326]
+    - Average MSE: 0.2425
+
+2. **Bootstrapping:**
+    - MSE Scores (first 5): [0.2408, 0.2316, 0.2475, 0.2507, 0.2412]
+    - Average MSE: 0.2397
+
+3. **AIC:**
+    - Value: -6247.93
+
+4. **Visualizations:**
+    - Residual distribution plot.
+    - Error comparison across methods.
+    - Bootstrapping confidence intervals.
+
+## Difference Between Bootstrapping and K-Fold Cross-Validation
+
+### Bootstrapping
+Bootstrapping is a powerful resampling method where we repeatedly draw samples from our dataset with replacement. Each sample is used to train the model, and its performance is evaluated on the remaining data. By repeating this process many times (e.g., 150 iterations), we can average the results to get a reliable estimate of the model's performance.
+
+### K-Fold Cross-Validation
+K-Fold Cross-Validation is a technique where we split the dataset into `k` equal-sized folds. The model is trained on `k-1` folds and tested on the remaining fold. This process is repeated `k` times, ensuring each fold is used once as the test set. By averaging the results, we can estimate the model's performance more accurately.
+
+
+
+##
+
+
+
+## Key Questions and Answers
+
+1. **Do Cross-Validation and Bootstrapping agree with AIC?**
+    - Yes, in our project, we found that all three methods generally agree. For instance, both cross-validation and bootstrapping show lower MSE scores for the chosen model. The AIC value of -6247.93 supports this by indicating a good model fit, aligning well with the performance metrics from cross-validation and bootstrapping.
+
+2. **In what cases might these methods fail?**
+    - **Cross-Validation:** May not perform well with very small datasets where the training data in each fold is insufficient to build a reliable model.
+    - **Bootstrapping:** Assumes that the data points are independently and identically distributed (i.i.d.); it may not be suitable for datasets with dependencies, such as time series data.
+    - **AIC:** Assumes that the residuals of the model are normally distributed; it may provide misleading results for models with non-linear relationships or non-normal residuals.
+
+3. **How can we address these challenges?**
+    - For imbalanced datasets, consider using stratified k-fold cross-validation to ensure each fold has a representative distribution of the target variable.
+    - For datasets with dependencies, such as time series data, use block bootstrapping to maintain the inherent structure and relationships within the data.
+    - To improve the reliability of AIC, especially for small sample sizes, use corrected versions like AICc (Akaike Information Criterion corrected).
+
+4. **What parameters can you customize in our project?**
+    - **Cross-Validation:** You can set the number of folds (`k`), with a default value of 5.
+    - **Bootstrapping:** You can specify the number of iterations (`num_iterations`), which defaults to 100, and set a seed for reproducibility.
+    - **Regression Models:** You can adjust the regularization parameter (`alpha`) for Ridge regression, with a default value of 1.0.
+
+## Additional Features
+- Synthetic data generator
+- Synthetic data generator for flexible testing.
+- Residual analysis and visualization tools for insights.
+- Extendable framework for adding new model selectors.
+
+## Compatibility
+- Tested on Windows
+- Requires `numpy`, `matplotlib`, `Jupyter`, `sklearn`
+
+## More data
+- if you had time check data folder for various plot and results
diff --git a/data/bootstrap_results.csv b/data/bootstrap_results.csv
@@ -0,0 +1,101 @@
+Bootstrap R^2,Bootstrap MAE,Bootstrap MSE
+0.9991926631507461,1.5559609032253738,3.8154968965175544
+0.9991941097989808,1.5570881569520012,3.808659995841913
+0.9991905617647121,1.5620862597054892,3.8254281066418216
+0.9991897289030771,1.5595151918472738,3.8293642344011176
+0.9991911439556526,1.558498454036697,3.822676655709695
+0.9991900395206639,1.5649364589673702,3.8278962468573927
+0.9991928211178499,1.5583275209068779,3.8147503394950912
+0.999191828935048,1.5571344864238779,3.819439423617633
+0.9991923931227429,1.557577928408874,3.816773056535697
+0.9991903703422763,1.5617699459168035,3.826332774514347
+0.9991907291771531,1.5630784232780783,3.8246369106872407
+0.9991866373199884,1.5606432017102552,3.843975143950856
+0.9991897975864277,1.5580396509041157,3.829039634934189
+0.9991896819944064,1.5592877515213486,3.8295859261123657
+0.999190330982975,1.5552007415130549,3.8265187876909086
+0.9991869669053676,1.5645980152329266,3.8424175140808763
+0.999192998325308,1.55908343351559,3.8139128520116974
+0.9991903568924005,1.5587913088233658,3.8263963390093667
+0.9991907486771314,1.5636324125597667,3.824544753235753
+0.9991916563448066,1.558463378715684,3.8202550900043453
+0.9991888504321821,1.5588667408746362,3.8335159128200553
+0.9991914803732785,1.556453177736431,3.8210867364469943
+0.9991930735084311,1.555852363322278,3.813557534434316
+0.9991938505176341,1.558985306781061,3.8098853668561463
+0.9991881103385182,1.5668349469064593,3.837013616511463
+0.999183042799408,1.5699507255947016,3.860962950380935
+0.9991907192419778,1.561860543072286,3.824683864608038
+0.9991882706376851,1.5627295755821506,3.8362560380920048
+0.9991894051772129,1.5643990886614607,3.830894172036893
+0.9991900977655868,1.5567653577996403,3.8276209796962295
+0.9991910414400481,1.5622588347176676,3.8231611473700466
+0.9991910496436559,1.562579302847192,3.8231223768862463
+0.9991927222256474,1.5602218788672608,3.8152177068550506
+0.9991927518210861,1.5563821677026215,3.8150778379700934
+0.9991907016369463,1.5619242089952567,3.8247670664871816
+0.9991896383485277,1.5628964865467605,3.8297921977757876
+0.9991889018516065,1.5562006275125055,3.8332729031586554
+0.999191128092668,1.5583747336401037,3.822751624625809
+0.9991889178862602,1.5658306454228643,3.8331971229293518
+0.9991911383186185,1.5597377625413358,3.822703296492906
+0.9991918362263862,1.557979490569291,3.81940496454599
+0.9991862754043846,1.5625369199437626,3.8456855673807286
+0.9991926336810646,1.5600408214395807,3.815636171069774
+0.9991873386552133,1.565718549446457,3.840660613743814
+0.9991813123531637,1.5743638167250789,3.869141088516039
+0.9991890105244606,1.5622424999351268,3.832759312161178
+0.9991914674485032,1.559646735472396,3.82114781930274
+0.9991884770081165,1.5655222925280652,3.8352807255675114
+0.9991896991548623,1.5610023960441808,3.82950482530977
+0.9991909971013361,1.5625810213290656,3.8233706933835157
+0.9991890770723222,1.5613958463076585,3.8324448050759274
+0.9991919926270098,1.5639074756737172,3.8186658107538087
+0.9991898164301484,1.558192006146937,3.8289505789744416
+0.9991876193121146,1.5685421313041834,3.839334221251023
+0.9991910355225598,1.5588748361598657,3.8231891135877167
+0.9991934222389026,1.5587722397104415,3.811909424314283
+0.9991922888921104,1.5602429975475147,3.817265653467282
+0.9991911407358479,1.5590656022606586,3.8226918725977623
+0.9991926618901131,1.559129473851755,3.815502854304773
+0.9991851669515363,1.5692264886498593,3.8509241470475972
+0.9991905442007555,1.5600511395530308,3.8255111143998355
+0.9991925747155832,1.5570849496654027,3.81591484361078
+0.9991849730386437,1.5682866287510164,3.8518405848891635
+0.9991909217763177,1.5577459597423977,3.823726681561072
+0.9991891819648242,1.5604091214240816,3.831949080130287
+0.9991909271629988,1.5607459554238463,3.8237012239530426
+0.9991922410905385,1.5576199261274624,3.817491565054786
+0.9991920965697888,1.5592393739649937,3.8181745742257434
+0.9991914655400295,1.5603729612623665,3.821156838804136
+0.9991944283531847,1.5571180712211172,3.80715449962018
+0.999191261080024,1.561239488407396,3.822123122601127
+0.9991926159639264,1.5610969037345404,3.8157199027683335
+0.9991913435835446,1.5607628541195295,3.821733208618275
+0.9991873777888308,1.5617804942616058,3.840475667155034
+0.99919155915556,1.559979354513557,3.8207144091455447
+0.9991912630141409,1.5568318427214374,3.822113981910038
+0.9991935150667014,1.5576844514875798,3.8114707174988443
+0.9991898976297084,1.558070139361835,3.8285668275456084
+0.9991922710215164,1.559736166167988,3.8173501104022862
+0.9991922702817089,1.5626523569893456,3.817353606754052
+0.9991901240278128,1.5588139716839955,3.8274968636696913
+0.9991934409112616,1.5604623288522128,3.8118211782149864
+0.9991926898639851,1.5595777766960843,3.815370648990614
+0.9991844330118544,1.567920307224953,3.8543927668448794
+0.9991932144963368,1.557496812820303,3.812891221584708
+0.9991917134674377,1.5567336783968853,3.8199851268254514
+0.9991890066526996,1.5603319251560341,3.83277761021376
+0.9991889793380094,1.5646794684839243,3.832906700216244
+0.9991862268950137,1.5642539564071207,3.845914824046804
+0.9991875362452204,1.5646380973337448,3.8397267977539746
+0.9991933572472644,1.5578480782503734,3.812216576643571
+0.9991898258303568,1.5650537823670698,3.828906153322433
+0.9991907786739209,1.5596782402302933,3.8244029875556453
+0.9991894349410514,1.562821145965995,3.830753507289892
+0.9991925384762144,1.5563961407207274,3.8160861118980667
+0.9991919304245226,1.5615159053287093,3.818959781476203
+0.9991876730223185,1.564511279561738,3.839080385301033
+0.9991840232228562,1.5668678070175237,3.8563294412978486
+0.9991912877000044,1.5580539602942705,3.821997315816789
+0.9991911278113924,1.5577427025924737,3.8227529539417655