You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduce File Extension Filtering Logic for GCP Storage and AWS S3 Connectors (#64)
This PR introduces new filtering logic to the GCP Storage Source and AWS S3 Source connectors, allowing users to include or exclude files based on their extensions during the source file search process. This enhancement provides finer control over which files are processed, improving the flexibility and efficiency of data ingestion.
GCP Storage Source Connector
New Properties:
connect.gcpstorage.source.extension.excludes:
Description: A comma-separated list of file extensions to exclude from the source file search. If this property is not configured, all files are considered.
Default: null (No filtering is enabled by default; all files are considered)
connect.gcpstorage.source.extension.includes:
Description: A comma-separated list of file extensions to include in the source file search. If this property is not configured, all files are considered.
Default: null (All extensions are included by default)
AWS S3 Source Connector
New Properties:
connect.s3.source.extension.excludes:
Description: A comma-separated list of file extensions to exclude from the source file search. If this property is not configured, all files are considered.
Default: null (No filtering is enabled by default; all files are considered)
connect.s3.source.extension.includes:
Description: A comma-separated list of file extensions to include in the source file search. If this property is not configured, all files are considered.
Default: null (All extensions are included by default)
How It Works
Include Filtering: If the source.extension.includes property is set, only files with extensions listed in this property will be considered for processing.
Exclude Filtering: If the source.extension.excludes property is set, files with extensions listed in this property will be ignored during processing.
Combined Use: When both properties are set, the connector will only include files that match the includes property and do not match the excludes property.
Use Cases:
Inclusion: Users can specify certain file types to process (e.g., .csv, .json), ensuring that only relevant files are ingested.
Exclusion: Users can exclude files with extensions that should not be processed (e.g., temporary files like .tmp or backup files like .bak).
* Source extension filters: part 1
* Wiring in
* Addressing review comments
* Making documentation more specific
Copy file name to clipboardexpand all lines: kafka-connect-aws-s3/src/main/scala/io/lenses/streamreactor/connect/aws/s3/source/config/S3SourceConfig.scala
Copy file name to clipboardexpand all lines: kafka-connect-aws-s3/src/main/scala/io/lenses/streamreactor/connect/aws/s3/source/config/S3SourceConfigDef.scala
+1
Original file line number
Diff line number
Diff line change
@@ -30,5 +30,6 @@ object S3SourceConfigDef extends S3CommonConfigDef with CloudSourceSettingsKeys
Copy file name to clipboardexpand all lines: kafka-connect-aws-s3/src/main/scala/io/lenses/streamreactor/connect/aws/s3/storage/AwsS3StorageInterface.scala
Copy file name to clipboardexpand all lines: kafka-connect-aws-s3/src/test/scala/io/lenses/streamreactor/connect/aws/s3/config/S3ConfigSettingsTest.scala
+1-1
Original file line number
Diff line number
Diff line change
@@ -35,7 +35,7 @@ class S3ConfigSettingsTest extends AnyFlatSpec with Matchers with LazyLogging {
Copy file name to clipboardexpand all lines: kafka-connect-cloud-common/src/main/scala/io/lenses/streamreactor/connect/cloud/common/config/traits/CloudConfig.scala
Copy file name to clipboardexpand all lines: kafka-connect-cloud-common/src/main/scala/io/lenses/streamreactor/connect/cloud/common/source/config/CloudSourceSettings.scala
Copy file name to clipboardexpand all lines: kafka-connect-cloud-common/src/main/scala/io/lenses/streamreactor/connect/cloud/common/source/config/CloudSourceSettingsKeys.scala
"Comma-separated list of file extensions to exclude from the source file search. If not configured, no files will be excluded. When used in conjunction with 'source.extension.includes', files must match the includes list and not match the excludes list to be considered."
"Comma-separated list of file extensions to include in the source file search. If not configured, all files are considered. When used in conjunction with 'source.extension.excludes', files must match the includes list and not match the excludes list to be considered."
"If you want to read to specific partitions when running the source. Options are 'hierarchical' (to match the sink's hierarchical file storage pattern) and 'regex' (supply a custom regex). Any other value will ignore original partitions and they should be evenly distributed through available partitions (Kafka dependent)."
Copy file name to clipboardexpand all lines: kafka-connect-cloud-common/src/main/scala/io/lenses/streamreactor/connect/cloud/common/source/config/S3SourceBucketOptions.scala
0 commit comments