From what I understand, the evaluation metrics are intended to assess how well each integration method performs batch correction and overall data integration. However, these tools are typically designed to handle technical variation. Would using these evaluation metrics, particularly the bio-conservation score, be appropriate when integrating data from different biological conditions, such as disease vs. control or stimulated vs. unstimulated samples?