-
Couldn't load subscription status.
- Fork 0
Parses raw twitter JSON from stdin using python. I'm only extracting a few fields for quick processing in PIG. Still a lot of work to do. Currently, it extracts id, timestamp, client program, author, and tweet text. I'll add more fields such as geo, if requested. The filenames for the output and bad tweets are currently hardcoded for my testing.…
Couldn't load subscription status.
beatgeek/tweetParser
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
twitter parser 2010 18Data license license license.... accepts twitter JSON from stdin and extracts tweet id, username, timestamp, client used, and tweet text trying to keep the output lightweight for performance reasons and to quickly process in map/reduce environments such as apache pig. big to-do - override the default filenames for the output file and bad file.
About
Parses raw twitter JSON from stdin using python. I'm only extracting a few fields for quick processing in PIG. Still a lot of work to do. Currently, it extracts id, timestamp, client program, author, and tweet text. I'll add more fields such as geo, if requested. The filenames for the output and bad tweets are currently hardcoded for my testing.…