Aggregator shouldn't refer to persistence DB in case of evaluation (#1632)

rahulga1 · web-flow · commit 90996c05eb40 · 2025-05-16T14:04:22.000+02:00
* Enhance Collaborator module with improved error handling and documentation.

Refactor the `run` method to include detailed error logging and a new `_execute_collaborator_rounds` method for better task management. Update the `start_` function to handle exceptions during collaborator initialization and execution, ensuring critical errors are logged and communicated to the user.

Additionally, improve type hints and docstrings for better code clarity and maintainability.

Signed-off-by: Rahul Garg &lt;rahul.garg@intel.com&gt;

* Improve error logging in Collaborator class by ensuring consistent formatting of log messages. This change enhances the clarity of critical error messages and maintains a uniform style across the logging functionality.

Signed-off-by: Rahul Garg &lt;rahul.garg@intel.com&gt;

* .DS_Store banished!

* Refine persistent checkpoint handling in Aggregator class to prevent recovery during evaluation mode. Update logging messages for clarity on checkpoint status. This change ensures that the system behaves correctly in different operational contexts.

Signed-off-by: Rahul Garg &lt;rahul.garg@intel.com&gt;

* Refine logging message in Aggregator class to clarify conditions for persistent checkpoint usage. The updated message improves understanding by specifying when checkpoints are disabled or when the experiment is in evaluation mode.

Signed-off-by: Rahul Garg &lt;rahul.garg@intel.com&gt;

---------

Signed-off-by: Rahul Garg &lt;rahul.garg@intel.com&gt;
diff --git a/openfl/component/aggregator/aggregator.py b/openfl/component/aggregator/aggregator.py
@@ -150,15 +150,17 @@ def __init__(
         self.quit_job_sent_to = []
 
         self.tensor_db = TensorDB()
-        if persist_checkpoint:
+        if persist_checkpoint and not self.assigner.is_task_group_evaluation():
             persistent_db_path = persistent_db_path or "tensor.db"
             logger.info(
                 "Persistent checkpoint is enabled, setting persistent db at path %s",
                 persistent_db_path,
             )
             self.persistent_db = PersistentTensorDB(persistent_db_path)
         else:
-            logger.info("Persistent checkpoint is disabled")
+            logger.info(
+                "Either persistent checkpoint is disabled or the experiment is in evaluation mode"
+            )
             self.persistent_db = None
         # FIXME: I think next line generates an error on the second round
         # if it is set to 1 for the aggregator.
@@ -225,8 +227,10 @@ def __init__(
 
             self.secagg = SecAggSetup(self.uuid, self.authorized_cols, self.tensor_db)
 
-        if self.persistent_db and self._recover():
-            logger.info("Recovered state of aggregator")
+        # Only recover from persistent DB if not in evaluation mode
+        if self.persistent_db and not self.assigner.is_task_group_evaluation():
+            if self._recover():
+                logger.info("Recovered state of aggregator")
 
         # TODO: Aggregator has no concrete notion of round_begin.
         # https://github.com/securefederatedai/openfl/pull/1195#discussion_r1879479537