-
Download the most recent kafka distribution (0.8.8.2)
-
Extract the tar file somewhere convenient :
- tar zxf kafka_2.10-0.8.2.0.tgz
-
Step into directory 'kafka_2.10-0.8.2.0'
-
Start a zookeeper instance :
- bin/zookeeper-server-start.sh config/zookeeper.properties
-
Open a new terminal and start the first kafka broker in the cluster :
- bin/kafka-server-start.sh config/server.properties
-
Open a new terminal window and make a copy of the configuration file for the second kafka broker in the cluster :
- cp config/server.properties config/server-1.properties
-
Adjust the configuration for the second broker to avoid brokerId, portNumber and logfile clashes.
- Change 'broker.id' to 1, 'port' to 9093, 'log.dir' to '/tmp/kafka-logs-1' in config/server-1.properties
-
Start the second kafka broker in the cluster:
- bin/kafka-server-start.sh config/server-1.properties
-
Open a new terminal window and make a copy of the configuration file for the second kafka broker in the cluster
- cp config/server.properties config/server-2.properties
-
Adjust the configuration for the third broker to avoid brokerId, portNumber and logfile clashes.
- Change 'broker.id' to 2, 'port' to 9094, 'log.dir' to '/tmp/kafka-logs-2' in config/server-2.properties
-
Start the third kafka broker in the cluster:
- bin/kafka-server-start.sh config/server-2.properties
-
Open a new terminal window and create a new 'github' topic, that will be used by producers and consumers of github events:
- bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic github
-
Verify that the topic was created succesfully
-
bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic github
Topic:github PartitionCount:1 ReplicationFactor:3 Configs: Topic: github Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
-
-
We should now have a kafka cluster with three brokers that looks something like this:
/ broker-0 zookeeper - broker-1 \ broker-2
-
Checkout the kafka-feeder repository from github:
-
We'll be working with githubarchive.org data. Create a directory for the data and step into it.
- mkdir ~/githubdata && cd ~/githubdata
-
Download a dataset from githubarchive.org that represents 1 weeks worth of events on all public repositories:
-
We'll be receiving a file gunzipped file per hour of public github activity, meaning 168 files. Extract them:
- ls -1 | xargs gunzip
-
Step into the kafka-feeder project directory and start sbt
-
Run the KafkaFeeder, which will start publishing (producing) events to the 'github' topic:
- run "localhost:9092,localhost:9093,localhost:9094" "github" "/path/to/githubdata" "10000"
-
You should be seeing output along the lines of :
Processing file '2015-01-01-0.json', pushing events to topic 'github' Processed file 2015-01-01-0.json, sleeping for 10000 milliseconds ...
- Step into the kafka directory
- Consume the stream of events being pushed to the 'github' topic:
- bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic github --from-beginning
- Make sure that Kafka only binds to localhost
- Uncomment '#host.name=localhost' in config/server-*.properties
- Enable the deletion of topics by adding the following entry to config/server-*.properties
- delete.topic.enable=true
- Deleting a topic from kafka
- bin/kafka-topics.sh --delete --zookeeper localhost:2181 --topic topic-name