-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54027] Kafka Source RTM support #52729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
89b8f7f
b69d1f8
59fbadd
7a244f0
7f66c9d
3880546
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -60,7 +60,11 @@ private[kafka010] class KafkaMicroBatchStream( | |||||
| metadataPath: String, | ||||||
| startingOffsets: KafkaOffsetRangeLimit, | ||||||
| failOnDataLoss: Boolean) | ||||||
| extends SupportsTriggerAvailableNow with ReportsSourceMetrics with MicroBatchStream with Logging { | ||||||
| extends SupportsTriggerAvailableNow | ||||||
| with SupportsRealTimeMode | ||||||
| with ReportsSourceMetrics | ||||||
| with MicroBatchStream | ||||||
| with Logging { | ||||||
|
|
||||||
| private[kafka010] val pollTimeoutMs = options.getLong( | ||||||
| KafkaSourceProvider.CONSUMER_POLL_TIMEOUT, | ||||||
|
|
@@ -93,6 +97,11 @@ private[kafka010] class KafkaMicroBatchStream( | |||||
|
|
||||||
| private var isTriggerAvailableNow: Boolean = false | ||||||
|
|
||||||
| private var inRealTimeMode = false | ||||||
| override def prepareForRealTimeMode(): Unit = { | ||||||
| inRealTimeMode = true | ||||||
| } | ||||||
|
|
||||||
| /** | ||||||
| * Lazily initialize `initialPartitionOffsets` to make sure that `KafkaConsumer.poll` is only | ||||||
| * called in StreamExecutionThread. Otherwise, interrupting a thread while running | ||||||
|
|
@@ -218,6 +227,93 @@ private[kafka010] class KafkaMicroBatchStream( | |||||
| }.toArray | ||||||
| } | ||||||
|
|
||||||
| override def planInputPartitions(start: Offset): Array[InputPartition] = { | ||||||
| // This function is used for real time mode. Trigger restrictions won't be supported. | ||||||
| if (maxOffsetsPerTrigger.isDefined) { | ||||||
| throw new UnsupportedOperationException( | ||||||
| "maxOffsetsPerTrigger is not compatible with real time mode") | ||||||
| } | ||||||
| if (minOffsetPerTrigger.isDefined) { | ||||||
| throw new UnsupportedOperationException( | ||||||
| "minOffsetsPerTrigger is not compatible with real time mode" | ||||||
| ) | ||||||
| } | ||||||
| if (options.containsKey(KafkaSourceProvider.MIN_PARTITIONS_OPTION_KEY)) { | ||||||
| throw new UnsupportedOperationException( | ||||||
| "minpartitions is not compatible with real time mode" | ||||||
| ) | ||||||
| } | ||||||
| if (options.containsKey(KafkaSourceProvider.ENDING_TIMESTAMP_OPTION_KEY)) { | ||||||
| throw new UnsupportedOperationException( | ||||||
| "endingtimestamp is not compatible with real time mode" | ||||||
| ) | ||||||
| } | ||||||
| if (options.containsKey(KafkaSourceProvider.MAX_TRIGGER_DELAY)) { | ||||||
| throw new UnsupportedOperationException( | ||||||
| "maxtriggerdelay is not compatible with real time mode" | ||||||
| ) | ||||||
| } | ||||||
|
|
||||||
| // This function is used by Low Latency Mode, where we expect 1:1 mapping between a | ||||||
|
||||||
| // This function is used by Low Latency Mode, where we expect 1:1 mapping between a | |
| // This function is used by real time mode, where we expect 1:1 mapping between a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KafkaMicroBatchStream's existing planInputPartitions calls kafkaOffsetReader.getOffsetRangesFromResolvedOffsets to handle partition offsets.
It handles deleted partitions cases but this new planInputPartitions doesn't, should we also do the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kafka doesn't support deleting partitions so I am not sure if that case is worth checking. If the topic was deleted and recreated the offsets which not be valid and we would fail in that case anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is what currently in getOffsetRangesFromResolvedOffsets called by KafkaMicroBatchStream.planInputPartitions:
if (newPartitionInitialOffsets.keySet != newPartitions) {
// We cannot get from offsets for some partitions. It means they got deleted.
val deletedPartitions = newPartitions.diff(newPartitionInitialOffsets.keySet)
reportDataLoss(
s"Cannot find earliest offsets of ${deletedPartitions}. Some data may have been missed",
() => KafkaExceptions.initialOffsetNotFoundForPartitions(deletedPartitions))
}The behavior of reportDataLoss is configurable. It can be a failure like what you did here, or a log warning.
I would suggest to follow existing behavior instead of two different behaviors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non zero offset new partitions case, getOffsetRangesFromResolvedOffsets delegates to reportDataLoss closure. Should we do the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add that and make the behavior consistent.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure if I miss something. rtmFetchLatestOffsetsTimeMs is assigned here, but it is not used anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me remove this variable it is no necessary.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not needed right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will revert this code based on this thread:
https://github.com/apache/spark/pull/52729/files#r2482113567
So this is not needed
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes original behavior? Previously it just uses latestPartitionOffsets without fetching latest offsets again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually fixing an issue with non-rtm queries using kafka. The calculation is is not correct here and will always result in the backlog metrics being zero. "latestPartitionOffsets" is calculated at when "latestOffset" is called at the beginning of a batch. It is basically the offset this batch will read up to so for non-rtm streaming queries latestConsumedOffset will be the same as latestPartitionOffsets resulting in zero backlog. What we should be doing is get the latest offsets from source kafka topic after the batch is processed i.e. when metrics() is called to calculate a useful backlog metric. I know this is not really related to RTM so let me know if I should just create a separate PR for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, let's focus on RTM in this PR and don't change/fix existing behavior here. Please open a separate PR to fix it if you think it is an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -63,6 +63,13 @@ private[kafka010] class InternalKafkaConsumer( | |||||
| private[consumer] var kafkaParamsWithSecurity: ju.Map[String, Object] = _ | ||||||
| private val consumer = createConsumer() | ||||||
|
|
||||||
| def poll(pollTimeoutMs: Long): ju.List[ConsumerRecord[Array[Byte], Array[Byte]]] = { | ||||||
| val p = consumer.poll(Duration.ofMillis(pollTimeoutMs)) | ||||||
| val r = p.records(topicPartition) | ||||||
| logDebug(s"Polled $groupId ${p.partitions()} ${r.size}") | ||||||
| r | ||||||
| } | ||||||
|
|
||||||
| /** | ||||||
| * Poll messages from Kafka starting from `offset` and returns a pair of "list of consumer record" | ||||||
| * and "offset after poll". The list of consumer record may be empty if the Kafka consumer fetches | ||||||
|
|
@@ -131,7 +138,7 @@ private[kafka010] class InternalKafkaConsumer( | |||||
| c | ||||||
| } | ||||||
|
|
||||||
| private def seek(offset: Long): Unit = { | ||||||
| def seek(offset: Long): Unit = { | ||||||
| logDebug(s"Seeking to $groupId $topicPartition $offset") | ||||||
| consumer.seek(topicPartition, offset) | ||||||
| } | ||||||
|
|
@@ -228,6 +235,19 @@ private[consumer] case class FetchedRecord( | |||||
| } | ||||||
| } | ||||||
|
|
||||||
| /** | ||||||
| * This class keeps returning the next records. If no new record is available, it will keep | ||||||
| * polling until timeout. It is used by KafkaBatchPartitionReader.nextWithTimeout(), to reduce | ||||||
| * seeking overhead in real time mode. | ||||||
| */ | ||||||
| private[sql] trait KafkaDataConsumerIterator { | ||||||
| /** | ||||||
| * Return the next record | ||||||
| * @return None if no new record is available after `timeoutMs`. | ||||||
| */ | ||||||
| def nextWithTimeout(timeoutMs: Long): Option[ConsumerRecord[Array[Byte], Array[Byte]]] | ||||||
| } | ||||||
|
|
||||||
| /** | ||||||
| * This class helps caller to read from Kafka leveraging consumer pool as well as fetched data pool. | ||||||
| * This class throws error when data loss is detected while reading from Kafka. | ||||||
|
|
@@ -272,6 +292,82 @@ private[kafka010] class KafkaDataConsumer( | |||||
| // Starting timestamp when the consumer is created. | ||||||
| private var startTimestampNano: Long = System.nanoTime() | ||||||
|
|
||||||
| /** | ||||||
| * Get an iterator that can return the next entry. It is used exclusively for real-time | ||||||
| * mode. | ||||||
| * | ||||||
| * It is called by KafkaBatchPartitionReader.nextWithTimeout(). Unlike get(), there is no | ||||||
| * out-of-bound check in this function. Since there is no endOffset given, we assume anything | ||||||
| * record is valid to return as long as it is at or after `offset`. | ||||||
| * | ||||||
| * @param startOffsets, the starting positions to read from, inclusive. | ||||||
| */ | ||||||
| def getIterator(offset: Long): KafkaDataConsumerIterator = { | ||||||
|
||||||
| def getIterator(offset: Long): KafkaDataConsumerIterator = { | |
| def getIterator(startOffsets: Long): KafkaDataConsumerIterator = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain more on this logging behavior? Why we need to do this logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tells us the semantics of of the timestamp column from a kafka record. That is, whether timestamp for records from this topic is set to wal append time (when the record is persisted by kafka brokers) or create time which is either when the record is produced by a kafka producer or is user-defined. This information is use when calculating latency to understand what journey we are actually measuring.