Skip to content

Latest commit

 

History

History
27 lines (25 loc) · 2.94 KB

File metadata and controls

27 lines (25 loc) · 2.94 KB

Glossary — Machine Learning Terms

Terms and definitions used across this project's codebase and notebooks.

Term Meaning
Feature An input variable used by the model to make predictions (the columns in X).
Target The variable the model tries to predict (y). Also called label or dependent variable.
Feature Engineering Creating new features from existing data to improve model performance (e.g. FamilySize, IsAlone, AgeBin).
Train/Test Split Dividing a dataset into a training set (to learn from) and a test set (to evaluate on). Typical ratio: 75-80% train, 20-25% test.
Overfitting When a model memorizes training data instead of learning generalizable patterns — performs well on training data but poorly on unseen data.
MinMaxScaler Normalization technique that scales features to a 0-1 range. Required when features have different scales.
Confusion Matrix A table comparing predicted vs actual classifications. Shows true positives, true negatives, false positives, and false negatives.
Accuracy Score Ratio of correct predictions to total predictions. Used for classification tasks.
R2 Score Regression metric measuring how well predictions match actual values. Best possible score is 1.0.
Cross-Validation (CV) Technique that splits data into multiple folds for training and validation, reducing evaluation bias. cv=5 means 5-fold cross-validation.
Hyperparameter Tuning Searching for the best model configuration parameters (e.g. n_neighbors, learning_rate, max_iter) rather than learning them from data.
GridSearchCV Exhaustive search over a specified parameter grid combined with cross-validation to find optimal hyperparameters.
KNeighborsClassifier (KNN) Classification algorithm that predicts based on the k closest data points in the feature space.
LinearRegression Model fitting a linear relationship between features and target by minimizing the residual sum of squares.
PolynomialFeatures Transforms features by generating polynomial combinations (e.g. 8 features become 45), capturing non-linear relationships.
HistGradientBoostingRegressor Optimized gradient boosting regression model. Builds trees sequentially, each correcting the previous one's errors.
hgbr Common abbreviation for HistGradientBoostingRegressor (see above). Used as a variable name in code for brevity.
RandomForestRegressor Ensemble model that trains many decision trees on random subsets and averages their predictions.
joblib Library used to serialize (save) and deserialize (load) trained models to/from disk (.pkl, .joblib files).
random_state Seed for reproducibility — ensures the same random splits and results across runs.
DataFrame Pandas tabular data structure (rows and columns) used throughout for data manipulation and preprocessing.