-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
care partner alerts #715
base: master
Are you sure you want to change the base?
care partner alerts #715
Conversation
374bae2
to
2deda23
Compare
8549c33
to
8367902
Compare
8367902
to
2ea9686
Compare
2ea9686
to
7246848
Compare
To use this in QA, it must be paired with tidepool-org/hydrophone#145 and tidepool-org/go-common#71 |
c50d589
to
986106b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, but the retry mechanism which is implemented here doesn't satisfy the latency requirements. The current implementation is ok for internal usage, but it's not production ready. This could be handled in a separate PR if this makes the development and QA process easier.
data/events/events.go
Outdated
} | ||
handler := asyncevents.NewSaramaConsumerGroupHandler(&asyncevents.NTimesRetryingConsumer{ | ||
Consumer: r.Config.MessageConsumer, | ||
Delay: CappedExponentialBinaryDelay(AlertsEventRetryDelayMaximum), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a suitable retry strategy given the latency requirements for this service. Kafka's consumer group concurrency is limited to the number of partitions of the topic. This number cannot be very high because Kafka's memory consumption grows linearly with the number of partitions. From this follows that the number of partitions is much lower than the number of users we will have and the data of multiple users will end up in the same partition. A failure to evaluate a single user's alerts for one minute as currently set by the CappedExponentialBinaryDelay
will introduce at least a minute delay to all of the users sharing the same partition, because messages in a single partition are processed serially.
Alert notifications should be near real-time - up to 10 seconds latency is acceptable. I think the solution proposed in this design document is how this should be handled. Other solutions which satisfy the requirements are welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require some more in-depth thought on my part... Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you're right, let's get this review merged, and I'll work on getting a multiple topic solution set up. Given the flexibility we have now, it shouldn't be too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the multi-tier retry in the eric-alerts-multi-topic-retry branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be implemented in this branch now.
c08e1fc
to
967c617
Compare
967c617
to
cd42b19
Compare
9432468
to
eaa652e
Compare
d6449a1
to
ee5da4a
Compare
@toddkazakov I just removed two config env vars. I believe we talked about that before, but it slipped my mind until I was reviewing the helm chart changes today, where they came up again. So the re-review here is just around the config parsing, in the most recent commit of the PR, nothing else is changed. [UPDATE] this comment is outdated. |
8ae1dc8
to
28fdf06
Compare
28fdf06
to
b9767dc
Compare
Requested in code review. #715 (comment) BACK-2499 BACK-2559
BACK-2499 BACK-2559
Previously, when completing a task, an available time of nil would cause the task to be marked as failed. Now, when a task completes and has available time of nil, time.Now() is substituted, which should cause the task to be run again ASAP. In addition, if the available time is in the past, it is substituted with time.Now(), so that it will run again ASAP. This supports the care partner no communication check, which wants to run 1x/second, but as that's not available with the task service (the smallest interval is 5 seconds), setting the value to 1 second intervals will run the task on each task service iteration. BACK-2559
…m duration BACK-2559
…ommunications BACK-2559
This need for this struct went away when retry delays were removed. BACK-2559
The minimum (and default value) is 5m. Validate that configs specify either 0, or a value in the range from 5m-6h. BACK-2559
This minimizes the Kafka load for non production environments. BACK-2559
As Todd pointed out in code review, there's no need for this to be separate from the existing DataRepository, as it only queries the deviceData collection. BACK-2559
BACK-2559
The tasks service and data service can both push alerts for care partners. This change consolidates that configuration into one set of environment variables loaded via the alerts package. I wish that alerts didn't need to know about envconfig, but at least for the moment it's the only way to consolidate the information about the configuration into a single re-usable struct. Naming of the pusher in both services is prefixed with "alerts" to communicate that this pusher is configured for care partner alerts. BACK-2559
a1b7712
to
93a978a
Compare
Rebased on latest master. |
786250a
to
a07d79e
Compare
BACK-2559
a07d79e
to
44fecf7
Compare
This matches what's done with existing consumer groups. BACK-2559
dd6cbaf
to
35c7de7
Compare
Kafka/Sarama won't log it (at least not by default), and this isn't something we expect to see, so log it. BACK-2559
Previously a glucose.Glucose was used, which was fine in that we only look at the UserID and UploadID fields, but there are other incompatible fields that can "spoil" the deserialization. BACK-2559
a9eb4cd
to
b33531a
Compare
This logger has some fields in it that can be useful for debugging. BACK-2559
Fields named message are quietly dropped by the platform log package. Also, logging []byte isn't super useful. BACK-2559
This used to be a series of PRs, but that didn't really work out. They're all collapsed into this one.
Shouldn't be merged until tidepool-org/go-common#71 is merged, then this should have it's go-common bumped.