Skip to content

Conversation

felipemello1
Copy link
Contributor

@felipemello1 felipemello1 commented Oct 9, 2025

  1. Changes logging mode, so its clearer:

Before:

reduce_across_ranks: bool
share_run_id: bool

After:

logging_mode: Enum[GLOBAL_REDUCE, PER_RANK_REDUCE, PER_RANK_NO_REDUCE]
per_rank_share_run: bool
  1. Adds class:
class LoggingMode(Enum):
    GLOBAL_REDUCE = "global_reduce"
    PER_RANK_REDUCE = "per_rank_reduce"
    PER_RANK_NO_REDUCE = "per_rank_no_reduce"
  1. Introduces the "PER_RANK_NO_REDUCE" mode. This means we call backend.log(metric) as soon as we get it, without any reduction.

Before, MetricLogger.push(metric) would just collect the metric. Now, it also logs.

def push(self, metric: Metric) -> None:
      # flush in "PER_RANK_NO_REDUCE" mode
      for backend in self.per_rank_no_reduce_backends:
            backend.log_stream(metric=metric, global_step=self.global_step)

      # Always accumulate for reduction and state return
        key = metric.key
        if key not in self.accumulators:
            self.accumulators[key] = metric.reduction.accumulator_class(
                metric.reduction
            )
        self.accumulators[key].append(metric.value)

Notice how x-axis is timestamp:
image

  1. Main design change: logger backends now have async def log_batch and def log_stream. It not totally clear to me if both should be async/sync or if i should try to unify them.
class LoggerBackend(ABC):
    """Abstract logger_backend for metric logging, e.g. wandb, jsonl, etc."""

    def __init__(self, logger_backend_config: dict[str, Any]) -> None:
        self.logger_backend_config = logger_backend_config

    @abstractmethod
    async def init(
        self,
        role: BackendRole,
        primary_logger_metadata: dict[str, Any] | None = None,
        process_name: str | None = None,
    ) -> None:
        """Initializes backend, e.g. wandb.run.init()."""
        pass

    @abstractmethod
    async def log_batch(
        self, metrics: list[Metric], global_step: int, *args, **kwargs
    ) -> None:
        """Log batch of accumulated metrics to backend"""
        pass

    def log_stream(self, metric: Metric, global_step: int, *args, **kwargs) -> None:
        """Stream single metric to backend immediately."""
        pass

    async def finish(self) -> None:
        pass

    def get_metadata_for_secondary_ranks(self) -> dict[str, Any] | None:
        """Return sharable state after primary init (e.g., for shared modes). Called only on globals."""
        return None

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 9, 2025
@felipemello1 felipemello1 changed the title Metric Logging updates 5/N [draft] Metric Logging updates 5/N Oct 9, 2025
@felipemello1 felipemello1 marked this pull request as ready for review October 9, 2025 20:56
@felipemello1 felipemello1 changed the title [draft] Metric Logging updates 5/N Metric Logging updates 5/N Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant