You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is that supposed to be file.txt and if so can you provide a sample file? perhaps the one used for the 1000 word model. Just want to know how to setup up the raw text - it is just a dump of text or does it need to formatted in any way?
The text was updated successfully, but these errors were encountered:
Almost any plain text file will work. Basically, you'll have a bunch of sentences in a text file without formatting (ie you'll want to avoid markdown, html, etc..). For example, the following is a good sample (note: you'll need a lot more text for good training):
Wikipedia is a multilingual online encyclopedia created and maintained as an open collaboration project by a community of volunteer editors using a wiki-based editing system. It is the largest and most popular general reference work on the World Wide Web. It is also one of the 15 most popular websites ranked by Alexa, as of June 2020. It features exclusively free content and no commercial ads and is owned and supported by the Wikimedia Foundation, a non-profit organization funded primarily through donations.
Wikipedia was launched on January 15, 2001, and was created by Jimmy Wales and Larry Sanger. Sanger coined its name as a portmanteau of the words "wiki" (Hawaiian for "quick") and "encyclopedia". Initially an English-language encyclopedia, versions of Wikipedia in other languages were quickly developed. With 6.1 million articles, the English Wikipedia is the largest of the more than 300 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 54 million articles attracting 1.5 billion unique visitors per month.
The script will handle tokeninzing that text into sentences and words. Documents do not matter for this type of training. Only sentences and words do - so you can put all of the text you want into one giant file. I find one giant file is hard to manage so you could instead have a bunch of plain text files in a directory and the script will work with all the files in the directory (or with a list of files).
this line in the readme
Is that supposed to be
file.txt
and if so can you provide a sample file? perhaps the one used for the 1000 word model. Just want to know how to setup up the raw text - it is just a dump of text or does it need to formatted in any way?The text was updated successfully, but these errors were encountered: