gpt2-bert-reddit-bot

series of scripts to fine-tune GPT-2 and BERT models using reddit data for generating realistic replies.

jupyter notebooks also available on Google Colab here

see my blog post for a walkthrough on running the scripts

processing training data

I use pandas read_gbq to read from google bigquery. get_reddit_from_gbq.py automates the download. prep_data.py cleans and transforms the data into a format that is usable by the GPT2 and BERT fine-tuning scripts. I manually upload the results from prep_data.py into Google Drive to be used by the Google Colab notebooks.

Here is a sample of the data format outputted from prep_data.py:

"Is there any way this could be posted as a document so it can be saved permanently, outwith reddit? [SEP] Could you not just copy and paste it yourself into a word processor document?"
"Seems like alt-history is a format that would almost *require* a detailed outline before writing [SEP] Are you aware of any good outliners or character sheets for writing novels? I like to organize and plan on the macro level and then, knowing what I want to accomplish and with which character, I can then discovery write at the micro level. "
"This is depressing [SEP] There are the books and they are excellent. There are also audiobooks which are also outstanding. Including side story novellas!

Also there is no apparent sign of James S. A. Corey (which is actually two authors: Daniel Abraham and Ty Franck) going all George R. R. Martin / Robert Jordan."

pulling reddit comments with praw

I use praw to download comments.

reddit = praw.Reddit(client_id='client_id', 
                     client_secret='client_secret',
                     password='reddit_password',
                     username='reddit_username',
                     user_agent='reddit user agent name')
                     
...
subreddit = reddit.subreddit(subreddit_name)
for h in subreddit.rising(limit=5):
  for c in h.comments:
    {do stuff}

See the code for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
colab_notebooks		colab_notebooks
.gitignore		.gitignore
LICENSE-CC-BY-SA		LICENSE-CC-BY-SA
README.md		README.md
get_reddit_from_gbq.py		get_reddit_from_gbq.py
prep_data.py		prep_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpt2-bert-reddit-bot

processing training data

pulling reddit comments with praw

training, generating, classifying

About

Releases

Packages

Languages

License

lots-of-things/gpt2-bert-reddit-bot

Folders and files

Latest commit

History

Repository files navigation

gpt2-bert-reddit-bot

processing training data

pulling reddit comments with praw

training, generating, classifying

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages