Skip to content

A simplified and improved version based on the original RecSysDatasets/conversion_tools

Notifications You must be signed in to change notification settings

Levia-Mobius/Simplified-Conversion_Tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Simplified Conversion Tools for RecBole Atomic Files

A simplified and improved version based on the original RecSysDatasets/conversion_tools

Overview

This repository provides a lightweight and improved version of the conversion tools originally included in the RecSysDatasets. Its primary goal is to convert public recommendation datasets, such as MovieLens and Amazon, into RecBole-compatible Atomic Files

The redesign is motivated by some issues encountered when using the original conversion_tools package, especially when processing Amazon 2023 datasets, where numerous parsing errors make direct conversion infeasible. To address these limitations, key modules were rewritten with a focus on clarity, and efficiency.

Key Improvements

Only the following modules/.py files are modified (paths below refer to the original repository). These files can be directly dropped into the corresponding locations of the original conversion_tools package:

  1. src/base_dataset.py
    • Added general preprocessing utilities for MovieLens datasets.
  2. src/extended_data.pysrc/light_extended.py
    • A newly implemented module: light_extended.py, replacing the original extended_data.py;
    • Currently ONLY supports Amazon 2023 (multiple sub-datasets with similar structure) and MovieLens (from 100k to 32M);
    • Removed nested for loops; all conversions are simplified;
    • Combined handling of structurally similar datasets to avoid redundant code;
    • Given the structural similarity among Amazon sub-datasets, testing has been conducted only on several subsets. If you encounter issues, please report them through GitHub: updates will follow promptly.
  3. src/utils.py
    • Adjusted to align with the redesigned .py files;
    • NOTE: Only MovieLens and Amazon datasets are supported at the moment!!!!
  4. run.py
    • Integrate the new lightweight modules;
    • Support richer movie metadata for MovieLens (meta.csv).
  5. meta.csv
    • Supplementary metadata obtained using the TMDb API, including: Movie descriptions, Release dates and Runtime. This enhances the MovieLens item feature quality when preparing RecBole datasets.

Notes

This project IS NOT an official fork; It is just an independent lightweight redesign intended to complement the original tools.

If you would like to directly download the converted atomic files, please visit the Google Drive. These atomic files are generated using the conversion tool on the RAW DATA downloaded from the official source, without any additional processing such as filtering or sorting.

Contact

If you find any issues or would like to request additional dataset support, please open a GitHub issue or contact me (email: ag.wrld.s@gmail.com) directly :)

About

A simplified and improved version based on the original RecSysDatasets/conversion_tools

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages