Skip to content

Conversation

@zihugithub
Copy link
Collaborator

Refactoring the CI trigger mechanism for Metax

@zihugithub zihugithub requested a review from a team as a code owner October 14, 2025 09:28
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zihugithub, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Continuous Integration (CI) framework for Metax by refactoring its trigger mechanism. The changes introduce a more structured and reliable approach to functional testing, expanding coverage for various large language models and ensuring stable test execution through pre-checks for GPU resource availability. The new modular configuration and robust result validation utilities aim to improve the overall efficiency and reliability of the CI pipeline.

Highlights

  • Expanded Inference Test Coverage: Introduced new functional inference test cases for various large language models, including DeepSeek R1 Distill Qwen, OPI Llama 3.1 Instruct, Qwen3, and RoboBrain2, with configurations for both standard and "flaggems" versions.
  • Robust CI Trigger Mechanism: Refactored the CI trigger logic for Metax, incorporating new shell scripts (_gpu_check.sh, test_all.sh, test_task.sh) to manage test execution, including a pre-check for GPU resource availability.
  • Modular Test Configuration and Validation: Implemented a modular system using config.yml and parse_config.py for defining and parsing test cases, alongside enhanced pytest utilities (conftest.py, test_result.py) for comprehensive result comparison against gold standards.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/all-tests-metax.yml
    • .github/workflows/functional-tests-metax.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of functional tests and a CI trigger mechanism for Metax devices. The changes are extensive, adding numerous configuration files, test scripts, and utility code. While the overall structure is a good starting point, I've identified several areas for improvement regarding robustness, maintainability, and correctness in the test scripts and configurations. My review includes suggestions to address brittle path constructions, anti-patterns like using sleep for synchronization, inconsistencies in configuration files, and opportunities for refactoring to reduce code duplication. Addressing these points will significantly improve the quality and reliability of the new testing framework.



@pytest.mark.usefixtures("test_path", "test_type", "test_task", "test_case")
def test_equal(test_path, test_type, test_task, test_case, monkeypatch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test function test_equal is decorated with pytest.mark.usefixtures and accepts several fixtures (test_path, test_type, test_task, test_case), but these are not used within the function body. The test makes a request with a hardcoded URL and data, which makes it inflexible and its purpose unclear. The test should be refactored to utilize the provided fixtures to run a meaningful, dynamic test case.


try:
result = parse_config(args.config, args.type, args.task)
print(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current method for parsing test cases from config.yml is fragile. It relies on the shell script to parse the string representation of a Python list. A more robust approach would be to:

  1. Use a standard list format in config.yml (e.g., - 7b-tp2).
  2. Modify this Python script to print the list elements joined by a space.
    This will make the configuration clearer and the parsing logic in the shell script more reliable.
Suggested change
print(result)
result = parse_config(args.config, args.type, args.task)
if isinstance(result, list):
print(" ".join(result))
else:
print(result)

fi

if [ "${_type}" = "inference" ]; then
run_command "python $FLAG_DIR/run.py --config-path tests/${PWD##*/}/tests/functional_tests/test_cases/${_type}/${_task}/conf --config-name ${_case} action=test" $attempt_i $_task $_type $_case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The path construction tests/${PWD##*/}/... is brittle as it depends on the name of the current working directory. This can cause tests to fail when run from different locations. You should use the ${device} variable, which is reliably determined by the get_device_type function, to ensure the path is always correct.

Suggested change
run_command "python $FLAG_DIR/run.py --config-path tests/${PWD##*/}/tests/functional_tests/test_cases/${_type}/${_task}/conf --config-name ${_case} action=test" $attempt_i $_task $_type $_case
run_command "python $FLAG_DIR/run.py --config-path tests/${device}/tests/functional_tests/test_cases/${_type}/${_task}/conf --config-name ${_case} action=test" $attempt_i $_task $_type $_case

fi

# Ensure that pytest check is completed before deleting the folder
sleep 10s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using sleep to wait for an operation to complete is an anti-pattern that can lead to either flaky tests (if the sleep is too short) or inefficient execution (if it's too long). The script should implement a reliable synchronization mechanism to ensure that the test results are fully written before pytest is executed. The run_command function appears synchronous, so if run.py spawns a background process, the script must explicitly wait for it.

Comment on lines +20 to +22
CUDNN_BENCHMARK: "false"
CUDNN_DETERMINISTIC: "true"
USE_FLAGGEMS: "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is an inconsistent use of boolean values for environment variables in this and other YAML configuration files. Some are defined as strings (e.g., "true"), while others use YAML's native boolean type (e.g., true on line 42). For consistency and to prevent potential parsing issues, it's best to use native YAML booleans throughout.

    CUDNN_BENCHMARK: false
    CUDNN_DETERMINISTIC: true
    USE_FLAGGEMS: true


assert os.path.exists(result_path), f"Failed to find 'stdout.log' at {result_path}"

with open(result_path, "r") as file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Opening files without specifying an encoding can lead to inconsistent behavior across different operating systems or locales, and may cause a UnicodeDecodeError if the file contains non-ASCII characters. It's a best practice to always specify the encoding. This applies to all open() calls throughout this file.

Suggested change
with open(result_path, "r") as file:
with open(result_path, "r", encoding="utf-8") as file:

Comment on lines +70 to +122
def test_inference_equal(test_path, test_type, test_task, test_case):
# Construct the test_result_path using the provided fixtures
test_result_path = os.path.join(test_path, test_type, test_task, "results_test", test_case)
result_path = os.path.join(test_result_path, "inference_logs/host_0_localhost.output")

print("result_path:", result_path)

assert os.path.exists(result_path), f"Failed to find 'host_0_localhost.output' at {result_path}"

with open(result_path, "r") as file:
lines = file.readlines()

result_lines = []
output = False
for line in lines:
assert "Failed to import 'flag_gems'" not in line, "Failed to import 'flag_gems''"
if line == "**************************************************\n":
output = True
if line == "##################################################\n":
output = False
if output == True:
result_lines.append(line)

gold_value_path = os.path.join(test_path, test_type, test_task, "results_gold", test_case)
assert os.path.exists(gold_value_path), f"Failed to find gold result at {gold_value_path}"

with open(gold_value_path, "r") as file:
gold_value_lines = file.readlines()

# Remove the blank line at the end.
if gold_value_lines:
last_non_empty = len(gold_value_lines) - 1
while last_non_empty >= 0 and not gold_value_lines[last_non_empty].strip():
last_non_empty -= 1
if last_non_empty >= 0:
gold_value_lines = gold_value_lines[: last_non_empty + 1]
else:
gold_value_lines = []

print("\nResult checking")
print("Result: ", result_lines)
print("Gold Result: ", gold_value_lines)

print("len(result_lines), (gold_value_lines): ", len(result_lines), len(gold_value_lines))
assert len(result_lines) == len(gold_value_lines)

for result_line, gold_value_line in zip(result_lines, gold_value_lines):
print(result_line, gold_value_line)
assert result_line.rstrip('\n') == gold_value_line.rstrip('\n')


@pytest.mark.usefixtures("test_path", "test_type", "test_task", "test_case")
def test_inference_pipeline(test_path, test_type, test_task, test_case):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The functions test_inference_equal and test_inference_pipeline contain a significant amount of duplicated code for locating and reading the inference output file. This redundancy makes the code harder to maintain. This common logic should be extracted into a shared helper function to improve modularity and reduce code duplication.


# Check the next three lines for equality before the '=' character
for j in range(1, 4):
result_parts = result_group[j].split('=')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using split('=') to parse key-value pairs from the log file is not robust, as it will fail if the value itself contains an = character. To handle such cases correctly, you should split only on the first occurrence of the delimiter.

Suggested change
result_parts = result_group[j].split('=')
result_parts = result_group[j].split('=', 1)


# Wait and show current status
echo "Waiting for Metax GPU memory usage to drop below 50% (current max usage: ${max_usage_percent}%)"
sleep 1m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A 1m sleep interval between GPU availability checks can be too long and may unnecessarily slow down the CI/CD pipeline. Consider reducing this to a shorter duration (e.g., 15s) to make the script more responsive to changes in GPU status.

Suggested change
sleep 1m
sleep 15s


for cmd in "${commands[@]}"; do
# Execute the command
$cmd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Executing commands directly from a variable ($cmd) can be unsafe and lead to issues with word splitting if arguments contain spaces or special characters. While the current commands are simple, using eval provides a safer and more robust way to execute commands stored in strings.

Suggested change
$cmd
eval "$cmd"

@gemini-code-assist
Copy link
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant