Skip to content

Latest commit

 

History

History
126 lines (97 loc) · 5.48 KB

README.md

File metadata and controls

126 lines (97 loc) · 5.48 KB

Logstash Connector POC

This repo is being used to proof out logstash and it's ability to assist in replicating an RDS database in Elasticsearch

Installations

Development Setup

  • Setup Python 3 virtual environment and install dependencies
  • Store a dev.env file at the root of your project with the following configurations
  • Configure an alias in your shell profile such that logstash executes the logstash program on your machine. Example: alias logstash=/usr/local/bin/logstash

Environment Configuration

You'll need to contribute a local file with the following configurations. This file will be used to populate your environment when calling app/config.py explicitly or when running the run_logstash.sh file.

   ELASTICSEARCH_URL=""
   RDS_PASSWORD=""
   RDS_USERNAME=""
   JDBC_DRIVER_LIBRARY=""
   JDBC_CONNECTION_STRING=""
   SQL_DB_NAME=""
   JDBC_DRIVER_CLASS=""

Running the Logstash Project

The entrypoint for this project is run_logstash.sh. If you step through the file you will notice it's doing a few things.

  • python app/config.py is called, which parses your dev.env file and produces a dev_env.sh file, which will export all your variables to your primary shell process
  • source dev_env.sh sources the aforementioned file.
  • logstash comes with a -r flag which watches for changes in the .conf file which configures the pipeline. This is useful for debugging purposes.
  • The -f flag is used as an identifier that the next argument will be the path the configuration file.

Things to note:

  • If running a pipeline who's output is an elasticsearch instance, make sure elasticsearch is up and running.

Python Elasticsearch interface

Primary manual interface for interacting with Elasticsearch.

  • Mappings for each index are created in the index_settings.py
  • Calling python app/interface.py will instantiate the mappings in the Elasticsearch instance for each defined index in index_settings.py
  • search.py will eventually be used to query against the Elasticsearch instance

Pipeline Configurations

Each ls_conf file is a different pipeline configuration for use with Logstash.

Each logstash.conf file is made up of 3 parts: input, filter, and output. The *_dev.conf files are a workaround in which the output dumps the results to a json file for quicker development, preventing the need to query an elasticsearch index.

input
  • used to establish a connection with a database
  • establishes a secure connection with the database via the JDBC (Java Database Connector)
  • runs a query against the database (queries stored in sql_scripts)
filter
  • parses and configures results of the input query and prepares them for delivery to the output
output
  • establishes a secure connection with the outbound database via the JDBC
  • there are also various textual output streams that can be used

For more detail, checkout the logstash docs, specifically:

Note the inline comments for more information...

input {
    # Connection to the jdbc connector. Each ${var} is picked up from the shell environment
  jdbc {
    jdbc_driver_library => "${JDBC_DRIVER_LIBRARY}"
    jdbc_connection_string => "${JDBC_CONNECTION_STRING}"
    jdbc_driver_class => "${JDBC_DRIVER_CLASS}"
    jdbc_user => "${RDS_USERNAME}"
    jdbc_password => "${RDS_PASSWORD}"
    # Path to the sql query to be executed and consequently mapped to elasticsearch
    statement_filepath => "sql_scripts/update_index.sql"
  }
}

filter {
    # Assumes a jsonified RDS query response from
    json{
      id => "id"
      source => "rows"
      remove_field => "rows"
    }
} 
output {
    # ES configuration. Notice there are no mappings here. Mappings are configured by the python interface as of now
  elasticsearch {
    document_id => "%{task_id}"
    document_type => "index"
    index => "indexs"
    codec => "json"
    hosts => "${ELASTICSEARCH_URL}"
  }
}

Other Resources

[logstash]: