Adaptive-Crawler

This Twitter Adaptive Crawler is based on the correlation between the traffic pattern of Hashtags. An enhanced version can be found from anther project CETRE

The full details are avaible through: Xinyue Wang, Laurissa Tokarchuk, Félix Cuadrado, and Stefan Poslad. 2013. Exploiting hashtags for adaptive microblog crawling. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM '13). DOI=10.1145/2492517.2492624!

Motivation

A Twitter crawler is software that filters a large number of twitter data, tweets, to select the ones of interest, i.e., that match a set of hashtags and key terms as search criteria. One important application is to use the Twitter crawler to detect unplanned events or trends of mass interest. However, the majority of Twitter crawlers use a set of predefined keywords that is often highly subjective and can easily lead to incomplete data. This is often because related terms which are note predefined, but which become key terms and hashtags, cannot be used as the (predefined) search criteria. Even, given expert knowledge, new keywords and specialised hashtags often arise in the midst of such events. Another issue is that in order to identify events and trends, we need to analyse a large collection tweets, however, free access to Twitter data is rate limited so that we can typically only access 1% of the available tweets. The consequence of this is that the effect of the set of limited key search terms is greatly aggravated and this means that we are far less likely to reliably detect unforeseen events or trends.

We have developed software (in Java) to automatically, without requiring manual modification of the search terms, to generate a better, more comprehensive set of search terms based upon correlating the traffic patterns of new key words against predefined words. We validated the Twitter crawler on the Olympic 2012, Glastonbury (UK) music festival 2013 events. This approach introducing high volume of additional traffic for event of interests in real-time.

Run the code

###Dependencies In order to run the program, machine must has the following tools/jars

JAVA
MySQL
MySQL jdbc
Twitter4j 3.0.3

###Accounts Please change them in the crawler/util/Settings.java

Twitter account
- Comsumer Key -> ConsumerKey
- Comsumer Secret -> ConsumerSecret
- Access Token -> AccessToken
- AccessSecret -> AccessSecret
MySQL database
- host name -> HOSTNAME
- user name -> USER
- password -> PWD
- database name -> databaseName

###Input parameters Parameters are initialized in the crawler/util/Settings.java

command line changable
- initial keywords -> baseKeywords
- time frame -> timer
- sample time slot -> sample
text file changable
- blacklist: the blacklist can be modified during the crawling, but must follow the format like "#keys"
others: please see in the file

###Outputs All the outputs are named with a prefix which indicates the running time & date. For example, if the cralwer is started at 12:00 30th Jun, the prefix will be "T06301200"...

Keywords List: a txt file records all the keywords will be generated under KeyWord file with name [prefix]KeywordList.txt
Black List: a txt file records all the black list keys with name [prefix]BlackList.txt (this can be modified during the crawling)
MySQL table: a table stores all the collected tweets with name [prefix]COR
- MySQL table format:
```
pid bigint(50) NOT NULL,
```
  createdAt text DEFAULT NULL, geoLocationLat double NOT NULL, geoLocationLong double NOT NULL, placeInfo text, id bigint(50) NOT NULL, tweet longtext CHARACTER SET utf8, source text CHARACTER SET utf8, lang text, screenName VARCHAR(150), replyTo text, rtCount bigint(50), hashtags text, PRIMARY KEY (pid)

###Entrance The main method is in the file cralwer/TwitterCrawler.java

Pros and Cons

Correlation

This version of Adaptive Crawler tries to identify new keywords that talking about the event of interests. However, the performance sometime is not stable. Namely, it can lead to new noisy terms being generated which would otherwise worsen the detection of related tweets. This is because, 1) people, sometimes include hashtags that are not really relevant to the event; 2) rate limits free Twitter API access disturb the traffic pattern; 3) the traffic pattern of some irrelevant hashtags present linear relationship with the pre-defined set at sometimes. This issue is extremely apparent when the event becomes a trending one. It is difficult for this version of crawler go back to the normal state. Additional, event with little extra new hashtags is not the target application. As a result, this RKwA has strict application to the medium traffic, but not tranquil events. Currently, we are working on the content similarity based adaptive crawler. It is supposed to work under any kind of event and achieve a good accuracy. A minor delay is the cost of its good performance.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
AdaptiveCrawler_Correlation		AdaptiveCrawler_Correlation
License		License
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive-Crawler

Motivation

Run the code

Pros and Cons

About

Releases

Packages

Languages

License

luhgit/Adaptive-Crawler

Folders and files

Latest commit

History

Repository files navigation

Adaptive-Crawler

Motivation

Run the code

Pros and Cons

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages