-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11444. Make Datanode Command metrics consistent across all commands #7191
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing the work on this @jianghuazhu. I think it's good that we have a chance to look over the metrics for all datanode commands in addition to the replication ones. Here are some high level ideas of things we could do in this PR. We can discuss which of these are most useful and not too difficult to implement:
- Switch to
MutableRate
for time based metrics in all commands instead of manually calculating averages.- Much of this existing code is probably older than the
MetricUtil
class, so that may be helpful to use now as well.
- Much of this existing code is probably older than the
- Ideally all replication tasks would be required to implement the
transferredBytes
metric as aMutableStat
. Currently onlyReplicationTask
has this.- This would take a bit more work since it has to get filled in at the time of replication.
- Queue wait time would also be good to track.
- This would be most useful for replication commands, which share the same queue.
- This would also require maintaining more state about when tasks entered the queue. Probably through some sort of
queuedTime
setter inAbstractReplicationTask
.
- This would also require maintaining more state about when tasks entered the queue. Probably through some sort of
- Other commands have independent queue logic so it may not generalize well.
- This would be most useful for replication commands, which share the same queue.
@@ -240,6 +242,7 @@ public void addTask(AbstractReplicationTask task) { | |||
failureCounter.put(task.getMetricName(), new AtomicLong(0)); | |||
timeoutCounter.put(task.getMetricName(), new AtomicLong(0)); | |||
skippedCounter.put(task.getMetricName(), new AtomicLong(0)); | |||
requestTotalTime.put(task.getMetricName(), new AtomicLong(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add queued count here as well? The replication supervisor already has this information, but it is exposed via the individual commands instead. It might be more intuitive to have it here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @errose28 for the comment and review.
I will update it later.
Right now, taskCounter records the number of queues.
private final Map<Class<?>, AtomicInteger> taskCounter =
new ConcurrentHashMap<>();
I will add the QueueCount metric to ReplicationSupervisorMetrics later.
In addition, I suggest improving taskCounter so that it is consistent with requestCounter and successCounter.
private final Map<String, AtomicLong> taskCounter = new ConcurrentHashMap<>();
What do you think?
@errose28 , your suggestion is very good.
For the transferredBytes metric in ECReconstructionCoordinatorTask, we need to do some work in ECReconstructionCoordinator and ECBlockReconstructedStripeInputStream. jira 2: Shared queues seem to be only useful for replication commands, so we should add an abstract method in CommandHandler.
If other commands need to use shared queues, just implement this method. In addition, this metric also needs to be collected in CommandHandlerMetrics#getMetrics(). In addition, there is another issue worth discussing. There are timeout metrics in the replication commands and DeleteContainerCommandHandler, but the timeoutCount of DeleteContainerCommandHandler is not used in many places. In theory, all commands have timeouts and failures. Now, except for the replication commands, we only know the total number of calls for other commands. The granularity discussed here seems a bit fine. |
@errose28 , I have updated it. Can you take a look at this PR again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should revisit the Grafana dashboards after this one gets in. There are a few new metrics added.
Merged. Thanks @jianghuazhu |
What changes were proposed in this pull request?
This PR is an additional improvement to HDDS-11376. The goal is to make Datanode command metrics consistent across all commands.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11444
How was this patch tested?
ci: https://github.com/jianghuazhu/ozone/actions/runs/10831917769
New jmx: