|
1 | 1 | # flink-connector-jdbc-elasticsearch-dialect
|
| 2 | + |
| 3 | +This module contains Flink JDBC dialect for Elasticsearch. |
| 4 | + |
| 5 | +## Elasticsearch Catalog |
| 6 | + |
| 7 | +This is an implementation of |
| 8 | +a [Flink Catalog](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/catalogs/) |
| 9 | +for [Elastic](https://www.elastic.co/). |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +### Possible Operations |
| 14 | + |
| 15 | +- `listDatabases` Lists Databases in a catalog. |
| 16 | +- `databaseExists` Checks if a database exists. |
| 17 | +- `listTables` Lists Tables in a Database. |
| 18 | +- `tableExists` Checks if a table exists. |
| 19 | +- `getTable` Gets the metadata information about the table. This consists of table schema and table properties. Table |
| 20 | + properties among others contain `CONNECTOR`, `BASE_URL`, `TABLE_NAME` and `SCAN_PARTITION` options. |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +### Scan options |
| 25 | + |
| 26 | +If we want tables in a catalog to be partitioned by a column we should specify scan options. |
| 27 | +It is possible to set |
| 28 | +up [Scan options](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/table/jdbc/#scan-partition-column:~:text=than%201%20second.-,scan.partition.column,-optional) |
| 29 | +while defining a catalog. |
| 30 | + |
| 31 | +There are 2 types of scan options for Elastic Catalog: |
| 32 | + |
| 33 | +#### Default scan options for a catalog |
| 34 | + |
| 35 | +We can specify default partitioning options for all tables in a catalog. If no options for a table are specified, these |
| 36 | +options will be used to |
| 37 | +select a column for partitioning and the number of partitions for a table will be calculated based on catalog default |
| 38 | +option. |
| 39 | + |
| 40 | +- `catalog.default.scan.partition.column.name` Specify what column to use for table partitioning by default. The default |
| 41 | + option will be used |
| 42 | + for all tables in a catalog. We can overwrite a column to use for partitioning of a table by specifying table specific |
| 43 | + scan options. |
| 44 | +- `catalog.default.scan.partition.size` Specify how many elements should be placed in a single partition. The number of |
| 45 | + partitions will be calculated based on the number of elements and the default size of a partition. If we want a |
| 46 | + particular table |
| 47 | + to have an exact number of partitions, we can specify that number using table specific scan options. |
| 48 | + |
| 49 | +#### Table specific scan options |
| 50 | + |
| 51 | +These options can be useful if we know that not all tables in a catalog should be partitioned in the same way. Here |
| 52 | +we can specify partitioning options for selected tables. |
| 53 | + |
| 54 | +- `properties.scan.{tablename}.partition.column.name` Specify the name of the column to use for partitioning of a table. |
| 55 | + Corresponds to the `scan.partition.column` option. |
| 56 | +- `properties.scan.{tablename}.partition.number` Specify the number of partitions for a table. Corresponds to the |
| 57 | + `scan.partition.num` option. |
| 58 | + |
| 59 | +For both of options specified above we should replace `{tablename}` with the name of the table that we want the options |
| 60 | +to apply to. |
| 61 | +We can provide these options for multiple tables. |
| 62 | + |
| 63 | +#### Index patterns |
| 64 | + |
| 65 | +If we specify an index pattern, a Flink table will be created in Catalog that instead of targeting a single index in |
| 66 | +Elastic will target all indexes that match |
| 67 | +the pattern provided. It is useful to use if we want to write Flink SQL that reads similar data from many similar tables |
| 68 | +instead of a single one. |
| 69 | +The resulting Flink table will contain all columns found in matching tables and will use all the data from matching |
| 70 | +tables. |
| 71 | +This table will have the same name as the pattern. |
| 72 | + |
| 73 | +- `properties.index.patterns` Specify patterns for which we want to create Flink tables. We can specify multiple index |
| 74 | + patterns by |
| 75 | + separating them with a comma `,` sign. |
| 76 | + |
| 77 | +The Flink tables created this way can also be partitioned just as other Flink tables by providing default catalog scan |
| 78 | +options or table specific scan options. |
| 79 | + |
| 80 | +#### Time attributes |
| 81 | + |
| 82 | +It is possible to add `proctime` column to each catalog table. |
| 83 | + |
| 84 | +```properties |
| 85 | +catalog.add-proctime-column=true |
| 86 | +``` |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +### Rules for overwriting catalog scan options |
| 91 | + |
| 92 | +#### No scan options were provided |
| 93 | + |
| 94 | +There is no necessity to provide either default scan options for a catalog or table specific scan options. If there are |
| 95 | +no scan options provided |
| 96 | +no tables in a catalog will be partitioned. |
| 97 | + |
| 98 | +#### Only default scan options for a catalog were provided |
| 99 | + |
| 100 | +If only default catalog scan options were provided, all tables in a catalog will be partitioned in a similar way. The |
| 101 | +same column name for table partitioning for all tables and |
| 102 | +the number of partitions for tables will be dependant on the number of records in a table. All tables will have the same |
| 103 | +maximum number of elements in a partition. |
| 104 | + |
| 105 | +#### Only table specific scan options were provided |
| 106 | + |
| 107 | +If we want a specific table to be partitioned and leave the rest of tables nonpartitioned we have to provide both table |
| 108 | +specific scan options. |
| 109 | + |
| 110 | +#### We specified both catalog default scan options and table specific scan were options |
| 111 | + |
| 112 | +Table specific scan options have higher priority over catalog default scan properties when deciding how to partition a |
| 113 | +table. |
| 114 | +If we specify catalog default partition column name and a table specific partition column name then table specific |
| 115 | +partition column name is taken into account. |
| 116 | +Similar thing happens when we specify catalog default scan partition size and table specific partition number. Instead |
| 117 | +of calculating the number of partitions for a table |
| 118 | +based on the count of elements, the table will have the number of partitions equal to the one provided for a table. |
| 119 | + |
| 120 | +--- |
| 121 | + |
| 122 | +### Calculation of scan partition bounds |
| 123 | + |
| 124 | +If a table is partitioned, meaning that we specified catalog default scan options or we specified table specific scan |
| 125 | +options the upper and lower bounds will be calculated. |
| 126 | +As specified in the Flink documentation, the `properties.scan.{tablename}.partition.column.name` option works for |
| 127 | +numeric and temporal data types. |
| 128 | +The `scan.partition.lower-bound` will be calculated as the lowest value in the table. |
| 129 | +The `scan.partition.upper-bound` will be calculated as the highest value in the table. |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +### Note that |
| 134 | + |
| 135 | +If we want a table to be partitioned it is necessary that we provide a catalog default or table specific option for |
| 136 | +partition column to use and |
| 137 | +catalog default or table specific partition number option for deciding how many partitions to use for a table. |
| 138 | +If only 1 option is provided we will receive an error. |
| 139 | + |
| 140 | +--- |
| 141 | + |
| 142 | +### Implementation details |
| 143 | + |
| 144 | +`com.getindata.flink.connector.jdbc.elasticsearch.database.catalog.ElasticsearchJdbcCatalogFactory` - has been copied |
| 145 | +because default CatalogFactory does not allow to pass custom catalog properties. |
| 146 | + |
| 147 | +`com.getindata.flink.connector.jdbc.elasticsearch.database.catalog.CopiedAbstractJdbcCatalog` is a copy of |
| 148 | +`org.apache.flink.connector.jdbc.core.database.catalog.AbstractJdbcCatalog` where JDBC validation is modified. |
0 commit comments