|
1 | | -# DfMD |
2 | | -Datasheets for Medical Datasets app. |
3 | | - |
4 | | -Requirements for Data and Data Dictionary files: |
5 | | -- .csv or .xlsx files |
6 | | -- no value are case sensitive |
7 | | -- List of NaN values: |
8 | | - - ["#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "None", "n/a", "nan", "null", "na", "-"] |
9 | | - |
10 | | -Data Dictionary: |
11 | | -- Requiered Columns: |
12 | | - - "variable type" |
13 | | - - only values permitted: |
14 | | - -"continuous", "categorical", "date and time" |
15 | | - - "role" |
16 | | - - only values permitted: |
17 | | - - "outcome", "feature", "identifier", "other" |
18 | | - - variables labeled as "other" are not evaluated in the app. |
19 | | - - If multiple variables are labeled as "identifier" only first is checked |
20 | | - |
21 | | -Data: |
22 | | - - List of allowed characters: |
23 | | - - '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c' |
24 | | - |
| 1 | +# DAIMS: Datasheets for AI and Medical Datasets |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +Despite progress in data engineering, inconsistencies in data validation and documentation procedures continue to cause confusion and technical challenges in research involving machine learning (ML). While frameworks like **Datasheets for Datasets** have made strides in addressing these challenges, there is room for improvement to better prepare datasets for ML pipelines. |
| 10 | + |
| 11 | +To bridge this gap, we introduce **DAIMS**: Datasheets for AI and Medical Datasets. DAIMS extends the foundational framework with additional tools and guidance tailored specifically for medical datasets and AI applications. |
| 12 | + |
| 13 | +## Key Features |
| 14 | + |
| 15 | +1. **Comprehensive Checklist** |
| 16 | + A 24-point checklist that covers common data standardization requirements. A subset of these checks is automated by the DAIMS software tool to validate dataset readiness. |
| 17 | + |
| 18 | +2. **Data Documentation Form** |
| 19 | + An extended documentation form designed to capture essential metadata, pose relevant research questions, and ensure datasets are well-prepared for ML analysis. |
| 20 | + |
| 21 | +3. **Data Dictionary Table** |
| 22 | + A tabular format to document variable descriptions, data types, units, and any relevant details about the dataset. |
| 23 | + |
| 24 | +4. **Flowchart for ML Analyses** |
| 25 | + A guided flowchart that maps research questions to suggested ML methods, providing researchers with a clear pathway to address their objectives. |
| 26 | + |
| 27 | +5. **Software Tool** |
| 28 | + A publicly available tool to assist in dataset preparation by automating key aspects of the checklist validation process. |
| 29 | + |
| 30 | +6. **Online App** |
| 31 | + DAIMS is available as an easy-to-use online app hosted at [https://daims-app.streamlit.app/](https://daims-app.streamlit.app/), enabling efficient dataset evaluation and preparation. |
| 32 | + |
| 33 | +## Benefits of DAIMS |
| 34 | + |
| 35 | +- **Standardization**: Promotes consistent practices for preparing datasets in medical research. |
| 36 | +- **Guidance**: Offers actionable insights through the flowchart and checklist, helping researchers align datasets with their ML objectives. |
| 37 | +- **Automation**: Saves time by automating key validation processes. |
| 38 | +- **Documentation**: Enhances transparency and reproducibility through detailed data documentation. |
| 39 | + |
| 40 | +## Getting Started |
| 41 | + |
| 42 | +1. **Access the Repository** |
| 43 | + Clone or download the DAIMS repository from GitHub: |
| 44 | + [https://github.com/PERSIMUNE/DAIMS](https://github.com/PERSIMUNE/DAIMS) |
| 45 | + |
| 46 | +2. **Explore the Online App** |
| 47 | + Use the online app for streamlined dataset evaluation: |
| 48 | + [https://daims-app.streamlit.app/](https://daims-app.streamlit.app/) |
| 49 | + |
| 50 | +3. **Follow the Checklist** |
| 51 | + Refer to the provided checklist to ensure datasets meet the 24 common data standardization requirements. |
| 52 | + |
| 53 | +4. **Document Your Dataset** |
| 54 | + Use the extended form and data dictionary table to comprehensively document your dataset. |
| 55 | + |
| 56 | +5. **Use the Flowchart** |
| 57 | + Map your research questions to suggested ML methods for clearer analytical direction. |
| 58 | + |
| 59 | +## Contributing |
| 60 | + |
| 61 | +We welcome contributions to improve DAIMS! Feel free to open issues, submit pull requests, or provide feedback on the GitHub repository. |
| 62 | + |
| 63 | +## License |
| 64 | + |
| 65 | +This project is licensed under the [MIT License](LICENSE). |
| 66 | + |
0 commit comments