If you came here from the CHUG, you might want to see the slide deck here #Introduction This is an illustrative project intended to demonstrate a simple, yet not quite as simple as Word Count, NLP algorithm implemented in Hadoop Map Reduce. This is also intended to demonstrate my opinions on unit testing Map Reduce jobs.
#Description The purpose of the project is to take a dataset of around 1400 speeches from Senators along with an ideal-point mapping onto the number line of the politicians based around the work of Simon Jackman to do the following:
- Find partition points separating Conservative, Moderate and Liberal senators by dividing the probability density into thirds (see histogram)
- Use this partition to construct three corpuses associated with Conservative, Moderate and Liberal senators
- Preprocess using a Porter Stemmer and rank the tokenized terms by Inverse Document Frequency
- Take the ranked terms and output the terms that are important in the corpus associated with one political orientation but not the others.
##Partition The partition of the probability space is split into even thirds as can be seen here:
##Output The results, for the curious, are here:
LIBERAL | MODERATE | CONSERVATIVE |
market | produce | discuss |
power | commission | past |
institutions | strengthening | politics |
initiate | provisions | pointed |
implement | subcommittee | direct |
establish | impact | different |
history | north | account |
available | managers | failed |
school | sure | debate |
according | hard | instead |
risk | sent | rates |
measures | defense | reason |
children | fiscal | congressional |
expand | honorable | budgeting |
trained | ability | question |
#Current Caveats
- A naive partition based on an even partition of the probability space may be unsuitable. I didn't know of a better, politically agnostic way to do this.
- An analysis based on n-grams would be more appropriate or at least a more intelligent chunking algorithm so that words like "north korea" are not split.
- This is not as efficient as it could be. I could have used term IDs instead of the actual terms, but I wanted the Map Reduce job to be as clear as possible
This code is mostly an intellectual lark and a demonstration of NLP done using Map Reduce that does not require the baggage of Mahout.
#Usage ##Prerequisites To execute this, you must have:
- A JDK installed
- Maven 2 or higher installed
To generate the histogram, you need to have R. To generate the presentation from the Cleveland CHUG, you need latex
##To generate the full example From the command line, in the political-nlp-analysis directory, execute
mvn integration-test
You can then use the political-nlp-analysis/generate_top_words.sh script to generate the top words for a particular political orientation thusly:
generate_top_words.sh {LIBERAL, MODERATE, CONSERVATIVE} <number of terms>
You can reconstruct the density plot using the political-nlp-analysis/src/main/R/generate_histogram.sh script:
generate_histogram.sh <path to ideal_points.csv>
#Contact
If anyone has comments, concerns or criticisms, please let me hear about it at [email protected]