Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V10.3 #249

Merged
merged 10 commits into from
Mar 6, 2025
13 changes: 0 additions & 13 deletions .github/workflows/check-autobuilder.yml

This file was deleted.

13 changes: 8 additions & 5 deletions .github/workflows/vale-tdbx.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,23 +12,26 @@ jobs:
- name: checkout
uses: actions/checkout@master

- name: Install docutils
run: sudo apt-get install -y docutils

- id: files
uses: masesgroup/retrieve-changed-files@v2
with:
format: 'csv'
format: "csv"

- name: checkout-latest-rules
uses: actions/checkout@master
with:
repository: mongodb/mongodb-vale-action
path: './tdbx-vale-rules'
path: "./tdbx-vale-rules"
token: ${{secrets.GITHUB_TOKEN}}

- name: move-files-for-vale-action
run: |
cp tdbx-vale-rules/.vale.ini .vale.ini
mkdir -p .github/styles/
cp -rf tdbx-vale-rules/.github/styles/ .github/
cp tdbx-vale-rules/.vale.ini .vale.ini
mkdir -p .github/styles/
cp -rf tdbx-vale-rules/.github/styles/ .github/

- name: run-vale
uses: errata-ai/vale-action@reviewdog
Expand Down
7 changes: 7 additions & 0 deletions build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# ensures that we always use the latest version of the script
if [ -f build-site.sh ]; then
rm build-site.sh
fi

curl https://raw.githubusercontent.com/mongodb/docs-worker-pool/netlify-poc/scripts/build-site.sh -o build-site.sh
sh build-site.sh
6 changes: 6 additions & 0 deletions netlify.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[[integrations]]
name = "snooty-cache-plugin"

[build]
publish = "snooty/public"
command = ". ./build.sh"
10 changes: 10 additions & 0 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"name": "docs-spark-connector",
"lockfileVersion": 3,
"requires": true,
"packages": {}
}

4 changes: 2 additions & 2 deletions source/batch-mode.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ Batch Mode

.. toctree::

/batch-mode/batch-read
/batch-mode/batch-write
Read </batch-mode/batch-read>
Write </batch-mode/batch-write>

Overview
--------
Expand Down
27 changes: 18 additions & 9 deletions source/batch-mode/batch-read-config.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,13 @@ Batch Read Configuration Options
:depth: 1
:class: singlecol

.. facet::
:name: genre
:values: reference

.. meta::
:keywords: partitioner, customize, settings

.. _spark-batch-input-conf:

Overview
Expand Down Expand Up @@ -107,12 +114,11 @@ You can configure the following properties when reading data from MongoDB in bat

[{"$match": {"closed": false}}, {"$project": {"status": 1, "name": 1, "description": 1}}]

.. important::

Custom aggregation pipelines must be compatible with the
partitioner strategy. For example, aggregation stages such as
``$group`` do not work with any partitioner that creates more than
one partition.
:gold:`IMPORTANT:` Custom aggregation pipelines must be
compatible with the partitioner strategy. For example,
aggregation stages such as
``$group`` do not work with any partitioner that creates more
than one partition.

* - ``aggregation.allowDiskUse``
- | Specifies whether to allow storage to disk when running the
Expand Down Expand Up @@ -212,9 +218,12 @@ based on your shard configuration.
To use this configuration, set the ``partitioner`` configuration option to
``com.mongodb.spark.sql.connector.read.partitioner.ShardedPartitioner``.

.. warning::

This partitioner is not compatible with hashed shard keys.
.. important:: ShardedPartitioner Restrictions

1. In MongoDB Server v6.0 and later, the sharding operation creates one large initial
chunk to cover all shard key values, making the sharded partitioner inefficient.
We do not recommend using the sharded partitioner when connected to MongoDB v6.0 and later.
2. The sharded partitioner is not compatible with hashed shard keys.

.. _conf-mongopaginatebysizepartitioner:
.. _conf-paginatebysizepartitioner:
Expand Down
2 changes: 1 addition & 1 deletion source/batch-mode/batch-read.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Read from MongoDB in Batch Mode
.. toctree::
:caption: Batch Read Configuration Options

/batch-mode/batch-read-config
Configuration </batch-mode/batch-read-config>

.. contents:: On this page
:local:
Expand Down
4 changes: 2 additions & 2 deletions source/batch-mode/batch-write.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Write to MongoDB in Batch Mode
.. toctree::
:caption: Batch Write Configuration Options

/batch-mode/batch-write-config
Configuration </batch-mode/batch-write-config>

Overview
--------
Expand Down Expand Up @@ -48,7 +48,7 @@ Overview
- Time-series collections

To learn more about save modes, see the
`Spark SQL Guide <https://spark.apache.org/docs/3.2.0/sql-data-sources-load-save-functions.html#save-modes>`__.
`Spark SQL Guide <https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes>`__.

.. important::

Expand Down
32 changes: 32 additions & 0 deletions source/getting-started.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,38 @@ Getting Started

.. include:: /scala/api.rst

Integrations
------------

The following sections describe some popular third-party platforms that you can
integrate Spark and the {+connector-long+} with.

Amazon EMR
~~~~~~~~~~

Amazon EMR is a managed cluster platform that you can use to run big data frameworks like Spark. To install Spark on an EMR cluster, see
`Getting Started with Amazon EMR <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html>`__ in the AWS documentation.

Databricks
~~~~~~~~~~

Databricks is an analytics platform for building, deploying, and sharing enterprise-level data. To integrate the {+connector-long+} with Databricks,
see `MongoDB <https://docs.databricks.com/aws/en/connect/external-systems/mongodb>`__ in the Databricks documentation.

Docker
~~~~~~

Docker is an open-source platform that helps developers build, share, and run applications in containers.

- To start Spark in a Docker container, see `Apache Spark <https://hub.docker.com/r/apache/spark#!>`__ in the Docker documentation and follow the steps provided.
- To learn how to deploy Atlas on Docker, see `Create a Local Atlas Deployment with Docker <https://www.mongodb.com/docs/atlas/cli/current/atlas-cli-deploy-docker/>`__.

Kubernetes
~~~~~~~~~~

Kubernetes is an open-source platform for automating containerization management. To run Spark on Kubernetes,
see `Running Spark on Kubernetes <https://spark.apache.org/docs/3.5.4/running-on-kubernetes.html>`__ in the Spark documentation.

Tutorials
---------

Expand Down
5 changes: 0 additions & 5 deletions source/includes/data-source.rst

This file was deleted.

4 changes: 0 additions & 4 deletions source/includes/note-trigger-method.rst

This file was deleted.

13 changes: 0 additions & 13 deletions source/includes/scala-java-explicit-schema.rst

This file was deleted.

24 changes: 12 additions & 12 deletions source/index.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,18 @@
MongoDB Connector for Spark
===========================

.. toctree::
:titlesonly:

Get Started <getting-started>
Configure Spark <configuration>
Configure TLS/SSL <tls>
Batch Mode </batch-mode>
Streaming Mode </streaming-mode>
FAQ <faq>
Release Notes <release-notes>
API Documentation <api-docs>

The `MongoDB Connector for Spark
<https://www.mongodb.com/products/spark-connector>`_ provides
integration between MongoDB and Apache Spark.
Expand Down Expand Up @@ -41,15 +53,3 @@ versions of Apache Spark and MongoDB:
* - **{+current-version+}**
- **3.1 through 3.5**
- **4.0 or later**

.. toctree::
:titlesonly:

Getting Started <getting-started>
configuration
tls
/batch-mode
/streaming-mode
faq
release-notes
api-docs
4 changes: 2 additions & 2 deletions source/streaming-mode.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ Streaming Mode

.. toctree::

/streaming-mode/streaming-read
/streaming-mode/streaming-write
Read </streaming-mode/streaming-read>
Write </streaming-mode/streaming-write>

Overview
--------
Expand Down
36 changes: 14 additions & 22 deletions source/streaming-mode/streaming-read-config.txt
Original file line number Diff line number Diff line change
Expand Up @@ -82,12 +82,10 @@ You can configure the following properties when reading data from MongoDB in str

[{"$match": {"closed": false}}, {"$project": {"status": 1, "name": 1, "description": 1}}]

.. important::

Custom aggregation pipelines must be compatible with the
partitioner strategy. For example, aggregation stages such as
``$group`` do not work with any partitioner that creates more than
one partition.
Custom aggregation pipelines must be compatible with the
partitioner strategy. For example, aggregation stages such as
``$group`` do not work with any partitioner that creates more than
one partition.

* - ``aggregation.allowDiskUse``
- | Specifies whether to allow storage to disk when running the
Expand Down Expand Up @@ -135,14 +133,12 @@ You can configure the following properties when reading a change stream from Mon
original document and updated document, but it also includes a copy of the
entire updated document.

For more information on how this change stream option works,
see the MongoDB server manual guide
:manual:`Lookup Full Document for Update Operation </changeStreams/#lookup-full-document-for-update-operations>`.

**Default:** "default"

.. tip::

For more information on how this change stream option works,
see the MongoDB server manual guide
:manual:`Lookup Full Document for Update Operation </changeStreams/#lookup-full-document-for-update-operations>`.

* - ``change.stream.micro.batch.max.partition.count``
- | The maximum number of partitions the {+connector-short+} divides each
micro-batch into. Spark workers can process these partitions in parallel.
Expand All @@ -151,11 +147,9 @@ You can configure the following properties when reading a change stream from Mon
|
| **Default**: ``1``

.. warning:: Event Order

Specifying a value larger than ``1`` can alter the order in which
the {+connector-short+} processes change events. Avoid this setting
if out-of-order processing could create data inconsistencies downstream.
:red:`WARNING:` Specifying a value larger than ``1`` can alter the order in which
the {+connector-short+} processes change events. Avoid this setting
if out-of-order processing could create data inconsistencies downstream.

* - ``change.stream.publish.full.document.only``
- | Specifies whether to publish the changed document or the full
Expand All @@ -174,12 +168,10 @@ You can configure the following properties when reading a change stream from Mon
- If you don't specify a schema, the connector infers the schema
from the change stream document.

**Default**: ``false``
This setting overrides the ``change.stream.lookup.full.document``
setting.

.. note::

This setting overrides the ``change.stream.lookup.full.document``
setting.
**Default**: ``false``

* - ``change.stream.startup.mode``
- | Specifies how the connector starts up when no offset is available.
Expand Down
2 changes: 1 addition & 1 deletion source/streaming-mode/streaming-read.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Read from MongoDB in Streaming Mode
.. toctree::
:caption: Streaming Read Configuration Options

/streaming-mode/streaming-read-config
Configuration </streaming-mode/streaming-read-config>

.. contents:: On this page
:local:
Expand Down
Loading
Loading