Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,11 +69,11 @@ To run tests: `mvn test`

To build binary: `mvn package`

- which will produce `tool/target/datacommons-import-tool-0.1-jar-with-dependencies.jar`
- which will produce `tool/target/datacommons-import-tool-<version>-jar-with-dependencies.jar`
- and you can run it with

```bash
java -jar tool/target/datacommons-import-tool-0.1-jar-with-dependencies.jar
java -jar tool/target/datacommons-import-tool-<version>-jar-with-dependencies.jar
```

> To run the above maven commands on M1 macs ([details][m1]), use the `-Dos.arch=x86_64` option.
Expand All @@ -92,11 +92,11 @@ To run tests: `mvn test`

To build binary: `mvn package`

- which will produce `server/target/datacommons-server-0.1.jar`
- which will produce `server/target/datacommons-server-<version>.jar`
- and you can run it with

```bash
java -jar server/target/datacommons-server-0.1.jar <file1.tmcf> <file2.csv>
java -jar server/target/datacommons-server-<version>.jar <file1.tmcf> <file2.csv>
```

Send a request:
Expand Down Expand Up @@ -149,6 +149,14 @@ steps in [contributing.md](contributing.md).

Wait for approval of the Pull Request and merge the change.

### Creating a Release

1. Update the version in `pom.xml` to the release version (remove `-SNAPSHOT`).
2. Build the project: `mvn clean package`.
3. Create a new Release on GitHub.
4. Upload the built artifacts (e.g., `tool/target/datacommons-import-tool-<version>-jar-with-dependencies.jar`) to the GitHub release.
5. Update the version in `pom.xml` to the next snapshot version (e.g., `0.3.2-SNAPSHOT`) and raise a Pull Request.

## License

Apache 2.0
Expand Down
9 changes: 8 additions & 1 deletion cloudbuild.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,14 @@ steps:
- AUTOPUSH_DC_API_KEY

- name: 'gcr.io/cloud-builders/gsutil'
args: ['cp', 'tool/target/datacommons-import-tool-0.1-SNAPSHOT-jar-with-dependencies.jar', 'gs://datacommons_public/import_tools/import-tool.jar']
entrypoint: 'bash'
args:
- '-c'
- |
# Find the JAR file using wildcard
JAR_FILE=$(ls tool/target/datacommons-import-tool-*-jar-with-dependencies.jar | head -n 1)
echo "Found JAR: ${JAR_FILE}"
gsutil cp "${JAR_FILE}" "gs://datacommons_public/import_tools/import-tool.jar"
Comment on lines +13 to +16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of ls ... | head -n 1 to find the JAR file is not robust. If the build directory is not clean, multiple JAR files with different versions might exist. ls sorts alphabetically, so this command could select an older version, leading to the deployment of an incorrect artifact. A safer approach is to ensure that exactly one JAR file matches the pattern and fail the build otherwise.

        # Find the JAR file, ensuring there is exactly one.
        JAR_FILES=(tool/target/datacommons-import-tool-*-jar-with-dependencies.jar)
        if [ "${#JAR_FILES[@]}" -ne 1 ]; then
          echo "Error: Expected 1 JAR file, but found ${#JAR_FILES[@]}. Aborting." >&2
          exit 1
        fi
        JAR_FILE="${JAR_FILES[0]}"
        echo "Found JAR: ${JAR_FILE}"
        gsutil cp "${JAR_FILE}" "gs://datacommons_public/import_tools/import-tool.jar"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consider passing the version as a substitution variable (similar to API key)?


- name: 'gcr.io/cloud-builders/gcloud'
args: ['builds', 'triggers', 'run', 'dc-import-executor', '--branch=master', '--substitutions', '_DOCKER_IMAGE=us-docker.pkg.dev/datcom-ci/gcr.io/dc-import-executor']
Expand Down
6 changes: 5 additions & 1 deletion pipeline/differ/template.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,17 @@ OPERATION=$1

if [ "$OPERATION" == "deploy" ]; then
echo "Deploying Dataflow Flex Template..."
# Find the JAR file using wildcard
JAR_FILE=$(ls target/differ-bundled-*.jar | head -n 1)
echo "Found JAR: ${JAR_FILE}"
Comment on lines +7 to +9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of ls ... | head -n 1 to find the JAR file is not robust. If the build directory is not clean, multiple JAR files with different versions might exist. ls sorts alphabetically, so this command could select an older version. A safer approach is to ensure that exactly one JAR file matches the pattern.

Suggested change
# Find the JAR file using wildcard
JAR_FILE=$(ls target/differ-bundled-*.jar | head -n 1)
echo "Found JAR: ${JAR_FILE}"
# Find the JAR file, ensuring there is exactly one.
JAR_FILES=(target/differ-bundled-*.jar)
if [ "${#JAR_FILES[@]}" -ne 1 ]; then
echo "Error: Expected 1 JAR file, but found ${#JAR_FILES[@]}. Aborting." >&2
exit 1
fi
JAR_FILE="${JAR_FILES[0]}"
echo "Found JAR: ${JAR_FILE}"


gcloud dataflow flex-template build \
"gs://datcom-templates/templates/flex/differ.json" \
--image-gcr-path "gcr.io/datcom-ci/dataflow-templates/differ:latest" \
--sdk-language "JAVA" \
--flex-template-base-image JAVA17 \
--metadata-file "metadata.json" \
--jar "target/differ-bundled-0.1-SNAPSHOT.jar" \
--jar "${JAR_FILE}" \
--env FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.datacommons.pipeline.differ.DifferPipeline"
elif [ "$OPERATION" == "run" ]; then
echo "Running Dataflow Flex Template..."
Expand Down
35 changes: 17 additions & 18 deletions pipeline/ingestion/cloudbuild.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,24 +19,23 @@ steps:
# 2. Build the Dataflow Flex Template
# This step uses the built JAR to create the Flex Template image and spec file.
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: 'gcloud'
entrypoint: 'bash'
args:
- 'dataflow'
- 'flex-template'
- 'build'
- 'gs://$_TEMPLATE_BUCKET/templates/flex/ingestion.json'
- '--image-gcr-path'
- '${_IMAGE_GCR_PATH}:$SHORT_SHA'
- '--sdk-language'
- 'JAVA'
- '--flex-template-base-image'
- 'JAVA17'
- '--metadata-file'
- 'pipeline/ingestion/metadata.json'
- '--jar'
- 'pipeline/ingestion/target/ingestion-bundled-${_VERSION}.jar'
- '--env'
- 'FLEX_TEMPLATE_JAVA_MAIN_CLASS=org.datacommons.ingestion.pipeline.ImportGroupPipeline'
- '-c'
- |
# Find the JAR file using wildcard
# Note: We are in the root of the workspace context for the build.
JAR_FILE=$(ls pipeline/ingestion/target/ingestion-bundled-*.jar | head -n 1)
echo "Found JAR: ${JAR_FILE}"

gcloud dataflow flex-template build \
gs://$_TEMPLATE_BUCKET/templates/flex/ingestion.json \
--image-gcr-path "${_IMAGE_GCR_PATH}:$SHORT_SHA" \
Comment on lines +32 to +33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

This section introduces a command injection vulnerability due to the insecure use of Cloud Build substitutions ($_TEMPLATE_BUCKET, ${_IMAGE_GCR_PATH}) within a bash -c block. Cloud Build performs string replacement before passing the script to the shell, which can lead to command injection if an attacker can influence these substitutions. Specifically, on line 32, $_TEMPLATE_BUCKET is unquoted, allowing command breakout, and on line 33, ${_IMAGE_GCR_PATH} inside double quotes still allows command substitution (e.g., $(id)). To remediate this, pass the substitutions as environment variables to the build step and use them within the shell script using shell variable expansion (e.g., "$${TEMPLATE_BUCKET}").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on this?

--sdk-language JAVA \
--flex-template-base-image JAVA17 \
--metadata-file pipeline/ingestion/metadata.json \
--jar "${JAR_FILE}" \
--env FLEX_TEMPLATE_JAVA_MAIN_CLASS=org.datacommons.ingestion.pipeline.ImportGroupPipeline
id: 'build-flex-template'

availableSecrets:
Expand All @@ -45,6 +44,6 @@ availableSecrets:
env: AUTOPUSH_DC_API_KEY

substitutions:
_VERSION: "0.1-SNAPSHOT"

_TEMPLATE_BUCKET: "datcom-templates"
_IMAGE_GCR_PATH: "gcr.io/datcom-ci/dataflow-templates/ingestion"
6 changes: 5 additions & 1 deletion pipeline/ingestion/template.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,17 @@ OPERATION=$1

if [ "$OPERATION" == "deploy" ]; then
echo "Deploying Dataflow Flex Template..."
# Find the JAR file using wildcard
JAR_FILE=$(ls target/ingestion-bundled-*.jar | head -n 1)
echo "Found JAR: ${JAR_FILE}"
Comment on lines +7 to +9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of ls ... | head -n 1 to find the JAR file is not robust. If the build directory is not clean, multiple JAR files with different versions might exist. ls sorts alphabetically, so this command could select an older version. A safer approach is to ensure that exactly one JAR file matches the pattern.

Suggested change
# Find the JAR file using wildcard
JAR_FILE=$(ls target/ingestion-bundled-*.jar | head -n 1)
echo "Found JAR: ${JAR_FILE}"
# Find the JAR file, ensuring there is exactly one.
JAR_FILES=(target/ingestion-bundled-*.jar)
if [ "${#JAR_FILES[@]}" -ne 1 ]; then
echo "Error: Expected 1 JAR file, but found ${#JAR_FILES[@]}. Aborting." >&2
exit 1
fi
JAR_FILE="${JAR_FILES[0]}"
echo "Found JAR: ${JAR_FILE}"


gcloud dataflow flex-template build \
"gs://datcom-templates/templates/flex/ingestion.json" \
--image-gcr-path "gcr.io/datcom-ci/dataflow-templates/ingestion:latest" \
--sdk-language "JAVA" \
--flex-template-base-image JAVA17 \
--metadata-file "metadata.json" \
--jar "target/ingestion-bundled-0.1-SNAPSHOT.jar" \
--jar "${JAR_FILE}" \
--env FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.datacommons.ingestion.pipeline.ImportGroupPipeline"
elif [ "$OPERATION" == "run" ]; then
echo "Running Dataflow Flex Template..."
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
<!-- Dependency versions -->
<revision>0.1-SNAPSHOT</revision>
<revision>0.3.1-SNAPSHOT</revision>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why go from 0.1 to 0.3.1..shall we do 0.1.1 instead?

<beam.version>2.67.0</beam.version>
<gson.version>2.10.1</gson.version>
<os.maven.plugin.version>1.7.1</os.maven.plugin.version>
Expand Down