HDDS-11444. Make Datanode Command metrics consistent across all commands #7191

jianghuazhu · 2024-09-12T15:28:04Z

What changes were proposed in this pull request?

This PR is an additional improvement to HDDS-11376. The goal is to make Datanode command metrics consistent across all commands.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11444

How was this patch tested?

ci: https://github.com/jianghuazhu/ozone/actions/runs/10831917769
New jmx:

jianghuazhu · 2024-09-12T15:35:34Z

@sodonnel @errose28 , can you help review this PR?
Thanks.

errose28

Thanks for continuing the work on this @jianghuazhu. I think it's good that we have a chance to look over the metrics for all datanode commands in addition to the replication ones. Here are some high level ideas of things we could do in this PR. We can discuss which of these are most useful and not too difficult to implement:

Switch to MutableRate for time based metrics in all commands instead of manually calculating averages.
- Much of this existing code is probably older than the MetricUtil class, so that may be helpful to use now as well.
Ideally all replication tasks would be required to implement the transferredBytes metric as a MutableStat. Currently only ReplicationTask has this.
- This would take a bit more work since it has to get filled in at the time of replication.
Queue wait time would also be good to track.
- This would be most useful for replication commands, which share the same queue.
  - This would also require maintaining more state about when tasks entered the queue. Probably through some sort of queuedTime setter in AbstractReplicationTask.
- Other commands have independent queue logic so it may not generalize well.

errose28 · 2024-09-12T21:56:24Z

...rvice/src/main/java/org/apache/hadoop/ozone/container/replication/ReplicationSupervisor.java

@@ -240,6 +242,7 @@ public void addTask(AbstractReplicationTask task) {
          failureCounter.put(task.getMetricName(), new AtomicLong(0));
          timeoutCounter.put(task.getMetricName(), new AtomicLong(0));
          skippedCounter.put(task.getMetricName(), new AtomicLong(0));
+          requestTotalTime.put(task.getMetricName(), new AtomicLong(0));


Should we add queued count here as well? The replication supervisor already has this information, but it is exposed via the individual commands instead. It might be more intuitive to have it here as well.

Thanks @errose28 for the comment and review.
I will update it later.
Right now, taskCounter records the number of queues.

private final Map<Class<?>, AtomicInteger> taskCounter = new ConcurrentHashMap<>();

I will add the QueueCount metric to ReplicationSupervisorMetrics later.
In addition, I suggest improving taskCounter so that it is consistent with requestCounter and successCounter.

private final Map<String, AtomicLong> taskCounter = new ConcurrentHashMap<>();

What do you think?

jianghuazhu · 2024-09-13T06:18:42Z

@errose28 , your suggestion is very good.
I will modify this PR later.
All time-based metrics in all commands use MutableRate to calculate the average time. I think the existing logic for calculating the total time should be deleted.
In addition, I think two new subtasks should be created:
jira 1: Add transferredBytes metric to all replication commands, including ECReconstructionCoordinatorTask and ReplicationTask. A protected method should be added in AbstractReplicationTask:

public abstract void addTransferredBytes(long transferredBytes);

For the transferredBytes metric in ECReconstructionCoordinatorTask, we need to do some work in ECReconstructionCoordinator and ECBlockReconstructedStripeInputStream.

jira 2: Shared queues seem to be only useful for replication commands, so we should add an abstract method in CommandHandler.

  default long getQueueTime() {
    return -1;
  }

If other commands need to use shared queues, just implement this method. In addition, this metric also needs to be collected in CommandHandlerMetrics#getMetrics().

In addition, there is another issue worth discussing. There are timeout metrics in the replication commands and DeleteContainerCommandHandler, but the timeoutCount of DeleteContainerCommandHandler is not used in many places. In theory, all commands have timeouts and failures. Now, except for the replication commands, we only know the total number of calls for other commands. The granularity discussed here seems a bit fine.

jianghuazhu · 2024-09-14T11:36:56Z

@errose28 , I have updated it. Can you take a look at this PR again?
ci: https://github.com/jianghuazhu/ozone/actions/runs/10860807384
New jmx:

jianghuazhu · 2024-09-25T03:58:04Z

@sodonnel @errose28 , do you have any new suggestions?

jojochuang

We should revisit the Grafana dashboards after this one gets in. There are a few new metrics added.

cc @kerneltime @muskan1012

jojochuang · 2024-10-02T16:12:42Z

Merged. Thanks @jianghuazhu

…nds (apache#7191)

HDDS-11444. Make Datanode Command metrics consistent across all commands

9f04144

errose28 self-requested a review September 12, 2024 16:06

errose28 reviewed Sep 12, 2024

View reviewed changes

jianghuazhu added 3 commits September 14, 2024 14:47

Calculating time using MutableRate

ed73be5

Fix some checkstyle.

60c7561

Fix some checkstyle.

726c0b8

jojochuang reviewed Sep 30, 2024

View reviewed changes

jojochuang approved these changes Sep 30, 2024

View reviewed changes

jojochuang merged commit a0f0872 into apache:master Oct 2, 2024
25 checks passed

sarvekshayr pushed a commit to sarvekshayr/ozone that referenced this pull request Oct 7, 2024

HDDS-11444. Make Datanode Command metrics consistent across all comma…

1cfac61

…nds (apache#7191)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11444. Make Datanode Command metrics consistent across all commands #7191

HDDS-11444. Make Datanode Command metrics consistent across all commands #7191

jianghuazhu commented Sep 12, 2024

jianghuazhu commented Sep 12, 2024

errose28 left a comment

errose28 Sep 12, 2024

jianghuazhu Sep 13, 2024

jianghuazhu commented Sep 13, 2024 •

edited

Loading

jianghuazhu commented Sep 14, 2024

jianghuazhu commented Sep 25, 2024

jojochuang left a comment

jojochuang commented Oct 2, 2024

HDDS-11444. Make Datanode Command metrics consistent across all commands #7191

HDDS-11444. Make Datanode Command metrics consistent across all commands #7191

Conversation

jianghuazhu commented Sep 12, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

jianghuazhu commented Sep 12, 2024

errose28 left a comment

Choose a reason for hiding this comment

errose28 Sep 12, 2024

Choose a reason for hiding this comment

jianghuazhu Sep 13, 2024

Choose a reason for hiding this comment

jianghuazhu commented Sep 13, 2024 • edited Loading

jianghuazhu commented Sep 14, 2024

jianghuazhu commented Sep 25, 2024

jojochuang left a comment

Choose a reason for hiding this comment

jojochuang commented Oct 2, 2024

jianghuazhu commented Sep 13, 2024 •

edited

Loading