FixML · WeilinHan8 · Dec 3, 2024 · Dec 5, 2024
diff --git a/src/fixml/data/checklist/checklist.csv/tests.csv b/src/fixml/data/checklist/checklist.csv/tests.csv
@@ -1,25 +1,40 @@
-ID,Topic,Title,Requirement,Explanation,References,Is Evaluator Applicable
-1.1,General,Write Descriptive Test Names,"Each test function should have a clear, descriptive name that accurately reflects the test's purpose and the specific functionality or scenario it examines.","If out tests are narrow and sufficiently descriptive, the test name itself may give us enough information to start debugging. This also helps us to identify what is being tested inside the function.","trenk2014, winters2024",0
-1.2,General,Keep Tests Focused,"Each test should focus on a single scenario, using only one set of mock data and testing one specific behavior or outcome to ensure clarity and isolate issues.","If we test multiple scenarios in a single test, it is hard to idenitfy exactly what went wrong. Keeping one scenario in a single test helps us to isolate problematic scenarios.",yu2018,0
-1.3,General,Prefer Narrow Assertions in Unit Tests,Assertions within tests should be focused and narrow. Ensure you are only testing relevant behaviors of complex objects and not including unrelated assertions.,"If we have overly wide assertions (such as depending on every field of a complex output proto), the test may fail for many unimportant reasons. False positives are the opoosite of actionable.",kent2024,0
-1.4,General,Keep Cause and Effect Clear,Keep any modifications to objects and the corresponding assertions close together in your tests to maintain readability and clearly show the cause-and-effect relationship.,Refrain from using large global test data structures shared across multiple unit tests. This will allow for clear identification of each test's setup and the cause and effect.,yu2017,0
-2.1,Data Presence,Ensure Data File Loads as Expected,"Ensure that data-loading functions correctly fetch datasets from predefined sources or online repositories. Additionally, verify that the functions handle errors or edge cases gracefully.","Reading data is a common scenario encountered in ML projects.  This item ensures that the data exists and can be loaded with expected format, and gracefully exit when unable to load the data.",msise2023,1
-2.2,Data Presence,Ensure Saving Data/Figures Function Works as Expected,"Verify that functions for saving data and figures perform write operations correctly, checking that the operation succeeds and the content matches the expected format.",Writing operations create artifacts at different stages of the analysis. Making sure the artifacts are created as expected ensures that the artifacts we obtained at the end of the analysis would be consistent and reproducible.,msise2023,0
-3.1,Data Quality,Files Contain Data,Ensure all data files are non-empty and contain the necessary data required for further analysis or processing tasks.,This checklist item is crucial as it confirms the presence of usable data within the files. It prevents errors in later stages of the project by ensuring data is available from the start.,msise2023,0
-3.2,Data Quality,Data in the Expected Format,"Verify that the data matches the expected format. This involves checking the shape, data types, values, and any other properties.","Ensuring that data and images are in the correct format is essential for compatibility with processing tools and algorithms, which may not handle unexpected formats gracefully.",msise2023,1
-3.3,Data Quality,Data Does Not Contain Null Values or Outliers,Check that data files are free from unexpected null values and identify any outliers that could affect the analysis. Tests should explicitly state if null values are part of expected data.,"Null values can lead to errors or inaccurate computations in many data processing applications, while outliers can distort statistical analyses and models. As such, these values should be checked when before the data is being ingested.",msise2023,0
-3.4,Data Quality,Validate Outliers Detection and Handling,Detect outliers in the dataset. Ensure that the outlier detection mechanism is sensitive enough to flag true outliers while ignoring minor anomalies.,The detection method should be precise enough to catch significant anomalies without being misled by minor variations. This is important for maintaining data quality and ensuring the model's reliability in certain projects.,ISO/IEC5259,0
-3.5,Data Quality,Check for Duplicate Records in Data,Verify that there are no duplicate records in the loaded data.,"Ensure that the dataset does not contain duplicate entries, as these can skew the results and reduce the model's performance. The test should identify any repeated records so they can be removed or investigated.",ISO/IEC5259,1
-4.1,Data Ingestion,Cleaning and Transformation Functions Work as Expected,"Test that a fixed input to a function or model produces the expected output, focusing on one verification per test to ensure predictable behavior.",Fixed input and output during the data cleaning and transformation routines should be tested so that no unexpected transformation is introduced during these steps.,msise2023,0
-4.2,Data Ingestion,Verify Data Split Proportion,Check that the data is split into training and testing sets in the expected proportion. Verify the split by checking the actual fraction of data points in the training and test sets.,"Confirm that the data is divided correctly into training and testing sets according to the intended ratio. This is crucial for ensuring that the model is trained and evaluated properly, with representative samples in each set.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf",1
-5.1,Model Fitting,Validate Model Input and Output Compatibility,Confirm that the model accepts inputs of the correct shapes and types and produces outputs that meet the expected shapes and types without any errors.,Ensuring that inputs and outputs conform to expected specifications is critical for the correct functioning of the model in a production environment.,msise2023,0
-5.2,Model Fitting,Check Model is Learning During Fit,"For parametric models, ensure that the model's weights update correctly per training iteration. For non-parametric models, verify that the data fits correctly into the model.",Making sure the training process is indeed training the model is crucial as model without training is not fitted to any data and the performance would suffer.,msise2023,0
-5.3,Model Fitting,Ensure Model Output Shape Aligns with Expectation,"Ensure that the structure of the model's output matches the expected format based on the task, such as checking the dimensions of the output versus the number of labels in classification task.",Correct output alignment confirms that the model is accurately interpreting the input data and making predictions that are sensible given the context.,jordan2020,1
-5.4,Model Fitting,Ensure Model Output Aligns with Task Trained,"Verify that the model's output values are appropriate for its task, such as outputting probabilities that sum to 1 for classification tasks.",This ensures that the model's output is interpretable and relevant to the task it was trained for.,jordan2020,0
-5.5,Model Fitting,Validate Loss Reduction on Gradient Update,"If using gradient descent for training, verify that a single gradient step on a batch of data results in a decrease in the model's training loss.",A decrease in training loss after a gradient update demonstrates that the data is adequate to be fitted into the model.,jordan2020,0
-5.6,Model Fitting,Check for Data Leakage,"Confirm that there is no leakage of data between training, validation, and testing sets, or across cross-validation folds, to ensure the integrity of the splits.","Data leakage can compromise the model's ability to generalize to unseen data, making it crucial to ensure datasets are properly segregated.",jordan2020,0
-6.1,Model Evaluation,Verify Evaluation Metrics Implementation,"Verify that the evaluation metrics are correctly implemented and appropriate for the model's task. Verify the metric computations with expected values to validate correctness.","Confirm that the metrics used to evaluate the model are implemented correctly and are suitable for the specific task at hand. This helps in accurately assessing the model's performance and understanding its strengths and weaknesses.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf",1
-6.2,Model Evaluation,Evaluate Model's Performance Against Thresholds,"Compute evaluation metrics for both the training and testing datasets. Verify that these metrics exceed threshold values, indicating acceptable model performance.","This ensures that the model's performance meets or exceeds certain benchmarks. By setting thresholds for metrics like accuracy or precision, you can automatically flag models that underperform or overfit. This is crucial for maintaining a baseline quality of results and for ensuring that the model meets the requirements necessary for deployment.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf",1
-7.1,Artifact Testing,Invariance Tests,"There are tests to ensure making a set of preturbations to the input that is expected to cause no effect would not change the model's output. For example, in sentiment analysis tasks, changing the name of the subject or the name of a location should not affect the sentence's sentiment.",Models should be making predictions in a robust and consistent manner such that changes to the input which are known or expected to have no effect on the model's task should not affect the model's prediction.,"jordan2020, ribeiro2020accuracy",0
-7.2,Artifact Testing,Directional Expectation Tests,"There are tests to ensure making a set of preturbations to the input that is expected to affect the predications in a particular direction would change the model's output in a same direction. For example, in a regression model, increasing the number of bathrooms (holding all other features constnat) should not cause a drop in price.",Models should be making predictions in a robust and consistent manner such that changes to the input which are known or expected to have an directional effect should be reflected in the model's predictions.,"jordan2020, ribeiro2020accuracy",0
-7.3,Artifact Testing,Minimum Functionality Tests,"If there are critical scenarios where prediction errors lead to high consequences, tests should be written for ensuring the model's behavior in the scenario is expected. For example, in text classification, a test can check the model's output for sentences less than 5 words would perform as expected if the performance on short pieces of text is critical.","In real world, some model outcomes are often more important than others. Traditional Machine Learning systems optimize for overall quality which might lead to over-optimistic and/or non-representable performance metrics in cases where the importance of certain outcomes far outweigh the rest. For example, falsely identifying valid emails as spam has a more serious consequence than identifying a spam as a valid email.","jordan2020, ribeiro2020accuracy",0
+ID,Topic,Requirement,References,Is Evaluator Applicable
+1.1,General,Test machine learning pipeline can run from end-to-end on a small subset of the data and handle failures appropriately,Breck et al (2017),1
+1.2,General,Test loading files (e.g. data; models) works as expected and handle failures appropriately,Microsoft Industry Solutions Engineering Team (2024),1
+1.3,General,Test saving files (e.g. data; models) works as expected and handle failures appropriately,,1
+2.1,Data extraction,Test connection to the data source (e.g. API; URL or file system) is successful and handle failures appropriately,,1
+2.2,Data extraction,Test extraction of data from source works as expected and handle failures appropriately,Microsoft Industry Solutions Engineering Team (2024),1
+2.3,Data extraction,Test extracted data filetype; structure and/or schema is correct and handle failures appropriately,,1
+3.1,Data Quality,Test validation of data format and handle invalid formats appropriately,Microsoft Industry Solutions Engineering Team (2024),1
+3.2,Data Quality,Data Test checks data schema/column names and handle errors or missingness  appropriately,Chorev et al (2022),1
+3.3,Data Quality,Test checks for data types and handle identified incorrect data types appropriately,Chorev et al (2022);Microsoft Industry Solutions Engineering Team (2024),1
+3.4,Data Quality,Test checks for duplicates and handle identified duplicates appropriately,Chorev et al (2022);Microsoft Industry Solutions Engineering Team (2024),1
+3.5,Data Quality,Test checks for category levels and handles any single values and string mismatches appropriately,Chorev et al (2022),1
+3.6,Data Quality,Test checks for missingness and handle identified missingness appropriately,Chorev et al (2022);Microsoft Industry Solutions Engineering Team (2024),1
+3.7,Data Quality,Test checks for outliers or anomalies and handles them appropriately,Chorev et al (2022);Microsoft Industry Solutions Engineering Team (2024);Breck et al (2017),1
+3.8,Data Quality,Test checks for anomalous correlations between target and features and between features. Handle anomalous correlations appropriately,Chorev et al (2022),1
+3.9,Data Quality,Test checks target distribution and handle deviations from expectations appropriately,Chorev et al (2022),1
+4.1,Data transformation,Test cleaning/transforming and/or feature engineering functions works as expected and handle failures appropriately.Common data transformations are: One-hot encoding; Ordinal variable encoding; Binning/discretization; Tokenization and vectorization; Log or power transformation; Feature Polynomial Expansion; Signal processing; Dimensionality reduction,Microsoft Industry Solutions Engineering Team (2024);Breck et al (2017),1
+5.1,Data splitting,Test splitting of data to training and test sets is of expected proportion and/or sizes and handle incorrect splits appropriately,Chorev et al (2022),1
+5.2,Data splitting,Test splitting of data does not duplicate observations between the training and test sets and handle overlaps appropriately,Chorev et al (2022),1
+5.3,Data splitting,Test splitting of data does not split groups of dependent observations between the training and test sets (e.g. time or geospatial) and handle any leakage appropriately,Chorev et al (2022),1
+5.4,Data splitting,Test splitting of data does not split groups of dependent observations between the training and test sets (e.g. time or geospatial) and handle any leakage appropriately,Chorev et al (2022),1
+5.5,Data splitting,Test pre-processor is only created from the training set and handle failures appropriately,,1
+6.1,Model training,Test model accept the correct inputs and handle errors appropriately,Microsoft Industry Solutions Engineering Team (2024),1
+6.2,Model training,Test model weights update during training and handle errors appropriately,Microsoft Industry Solutions Engineering Team (2024);Breck et al (2017),1
+7.1,Model outputs and evaluation,Test model produces the correctly shaped outputs and handle errors appropriately,Microsoft Industry Solutions Engineering Team (2024),1
+7.2,Model outputs and evaluation,Test model output ranges align with our expectations and handle deviations appropriately,Jordan (2020);Breck et al (2017),1
+7.3,Model outputs and evaluation,Test model performance compared to a very simple or baseline model and handle and performance issues appropriately,Breck et al (2017),1
+7.4,Model outputs and evaluation,Test model for systematic errors and handle errors appropriately,Microsoft Industry Solutions Engineering Team (2024),1
+7.5,Model outputs and evaluation,Test model for directionality of predictions and handle errors appropriately,Ribeiro et al. (2020),1
+7.6,Model outputs and evaluation,Test model for invariance for predictions and handle invariances appropriately,Ribeiro et al. (2020),1
+7.7,Model outputs and evaluation,Test model performance meets minimum expectations and handle subpar performance appropriately,Chorev et al (2022),1
+7.8,Model outputs and evaluation,Test model performance across important data slices and handle subpar performance on particular slices appropriately,Breck et al (2017),1
+8.1,Model stability,Test for model weights and/or performance stability during training and handle instability appropriately,,1
+9.1,Bias/fairness issues,Test check for bias in data sets (overall; training; test; predictions) and handle and data bias appropriately,,1
+9.2,Bias/fairness issues,Test for performance bias for protected groups and handle any performance bias appropriately,Chorev et al (2022);Breck et al (2017),1
+10.1,Data drift,Test code checks for drift in prediction data distribution or feature correlations and handle drift appropriately,Chorev et al (2022);Breck et al (2017),1
+10.2,Data drift,Test model prediction performance against defined thresholds and handle any performance drift appropriately,,1
+11.1,Reproducibility,Test running the entire machine learning project pipeline (start to finish) can be automated and handle any errors appropriately,Breck et al (2017),1
+11.2,Reproducibility,Test model weights and/or prediction outputs are not meaningfully different on different runs. Handle any differences appropriately,Breck et al (2017),1
+11.3,Reproducibility,Test model weights and/or prediction outputs are not meaningfully different on different operating systems and handle differences appropriately,,1