Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11444. Make Datanode Command metrics consistent across all commands #7191

Merged
merged 4 commits into from
Oct 2, 2024

Conversation

jianghuazhu
Copy link
Contributor

What changes were proposed in this pull request?

This PR is an additional improvement to HDDS-11376. The goal is to make Datanode command metrics consistent across all commands.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11444

How was this patch tested?

ci: https://github.com/jianghuazhu/ozone/actions/runs/10831917769
New jmx:
image

@jianghuazhu
Copy link
Contributor Author

@sodonnel @errose28 , can you help review this PR?
Thanks.

@errose28 errose28 self-requested a review September 12, 2024 16:06
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing the work on this @jianghuazhu. I think it's good that we have a chance to look over the metrics for all datanode commands in addition to the replication ones. Here are some high level ideas of things we could do in this PR. We can discuss which of these are most useful and not too difficult to implement:

  • Switch to MutableRate for time based metrics in all commands instead of manually calculating averages.
    • Much of this existing code is probably older than the MetricUtil class, so that may be helpful to use now as well.
  • Ideally all replication tasks would be required to implement the transferredBytes metric as a MutableStat. Currently only ReplicationTask has this.
    • This would take a bit more work since it has to get filled in at the time of replication.
  • Queue wait time would also be good to track.
    • This would be most useful for replication commands, which share the same queue.
      • This would also require maintaining more state about when tasks entered the queue. Probably through some sort of queuedTime setter in AbstractReplicationTask.
    • Other commands have independent queue logic so it may not generalize well.

@@ -240,6 +242,7 @@ public void addTask(AbstractReplicationTask task) {
failureCounter.put(task.getMetricName(), new AtomicLong(0));
timeoutCounter.put(task.getMetricName(), new AtomicLong(0));
skippedCounter.put(task.getMetricName(), new AtomicLong(0));
requestTotalTime.put(task.getMetricName(), new AtomicLong(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add queued count here as well? The replication supervisor already has this information, but it is exposed via the individual commands instead. It might be more intuitive to have it here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @errose28 for the comment and review.
I will update it later.
Right now, taskCounter records the number of queues.

private final Map<Class<?>, AtomicInteger> taskCounter =
      new ConcurrentHashMap<>();

I will add the QueueCount metric to ReplicationSupervisorMetrics later.
In addition, I suggest improving taskCounter so that it is consistent with requestCounter and successCounter.

private final Map<String, AtomicLong> taskCounter = new ConcurrentHashMap<>();

What do you think?

@jianghuazhu
Copy link
Contributor Author

jianghuazhu commented Sep 13, 2024

@errose28 , your suggestion is very good.
I will modify this PR later.
All time-based metrics in all commands use MutableRate to calculate the average time. I think the existing logic for calculating the total time should be deleted.
In addition, I think two new subtasks should be created:
jira 1: Add transferredBytes metric to all replication commands, including ECReconstructionCoordinatorTask and ReplicationTask. A protected method should be added in AbstractReplicationTask:

public abstract void addTransferredBytes(long transferredBytes);

For the transferredBytes metric in ECReconstructionCoordinatorTask, we need to do some work in ECReconstructionCoordinator and ECBlockReconstructedStripeInputStream.

jira 2: Shared queues seem to be only useful for replication commands, so we should add an abstract method in CommandHandler.

  default long getQueueTime() {
    return -1;
  }

If other commands need to use shared queues, just implement this method. In addition, this metric also needs to be collected in CommandHandlerMetrics#getMetrics().

In addition, there is another issue worth discussing. There are timeout metrics in the replication commands and DeleteContainerCommandHandler, but the timeoutCount of DeleteContainerCommandHandler is not used in many places. In theory, all commands have timeouts and failures. Now, except for the replication commands, we only know the total number of calls for other commands. The granularity discussed here seems a bit fine.

@jianghuazhu
Copy link
Contributor Author

@errose28 , I have updated it. Can you take a look at this PR again?
ci: https://github.com/jianghuazhu/ozone/actions/runs/10860807384
New jmx:
image
image

@jianghuazhu
Copy link
Contributor Author

@sodonnel @errose28 , do you have any new suggestions?

Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revisit the Grafana dashboards after this one gets in. There are a few new metrics added.

cc @kerneltime @muskan1012

@jojochuang jojochuang merged commit a0f0872 into apache:master Oct 2, 2024
25 checks passed
@jojochuang
Copy link
Contributor

Merged. Thanks @jianghuazhu

sarvekshayr pushed a commit to sarvekshayr/ozone that referenced this pull request Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants