A simplified and improved version based on the original RecSysDatasets/conversion_tools
This repository provides a lightweight and improved version of the conversion tools originally included in the RecSysDatasets. Its primary goal is to convert public recommendation datasets, such as MovieLens and Amazon, into RecBole-compatible Atomic Files
The redesign is motivated by some issues encountered when using the original conversion_tools package, especially when processing Amazon 2023 datasets, where numerous parsing errors make direct conversion infeasible. To address these limitations, key modules were rewritten with a focus on clarity, and efficiency.
Only the following modules/.py files are modified (paths below refer to the original repository). These files can be directly dropped into the corresponding locations of the original conversion_tools package:
src/base_dataset.py- Added general preprocessing utilities for MovieLens datasets.
src/extended_data.py→src/light_extended.py- A newly implemented module:
light_extended.py, replacing the originalextended_data.py; - Currently ONLY supports Amazon 2023 (multiple sub-datasets with similar structure) and MovieLens (from 100k to 32M);
- Removed nested
forloops; all conversions are simplified; - Combined handling of structurally similar datasets to avoid redundant code;
- Given the structural similarity among Amazon sub-datasets, testing has been conducted only on several subsets. If you encounter issues, please report them through GitHub: updates will follow promptly.
- A newly implemented module:
src/utils.py- Adjusted to align with the redesigned
.pyfiles; - NOTE: Only MovieLens and Amazon datasets are supported at the moment!!!!
- Adjusted to align with the redesigned
run.py- Integrate the new lightweight modules;
- Support richer movie metadata for MovieLens (
meta.csv).
meta.csv- Supplementary metadata obtained using the TMDb API, including: Movie descriptions, Release dates and Runtime. This enhances the MovieLens item feature quality when preparing RecBole datasets.
This project IS NOT an official fork; It is just an independent lightweight redesign intended to complement the original tools.
If you would like to directly download the converted atomic files, please visit the Google Drive. These atomic files are generated using the conversion tool on the RAW DATA downloaded from the official source, without any additional processing such as filtering or sorting.
If you find any issues or would like to request additional dataset support, please open a GitHub issue or contact me (email: ag.wrld.s@gmail.com) directly :)