Skip to content
32 changes: 32 additions & 0 deletions source/release-notes.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,38 @@
Release Notes
=============

MongoDB Connector for Spark 10.3
--------------------------------

The 10.3 connector release includes the following new features:

- Added support for reading multiple collections when using micro-batch or
continuous streaming modes.

.. warning:: Breaking Change

Support for reading multiple collections introduces the following breaking
changes:

- If the name of a collection used in your ``collection`` configuration
option contains a comma, the
{+connector-short+} treats it as two different collections. To avoid
this, you must escape the comma by preceding it with a backslash (\\).

- If the name of a collection used in your ``collection`` configuration
option is "*", the {+connector-short+} interprets it as a specification
to scan all collections. To avoid this, you must escape the asterisk by preceding it
with a backslash (\\).

- If the name of a collection used in your ``collection`` configuration
option contains a backslash (\\), the
{+connector-short+} treats the backslash as an escape character, which
might change how it interprets the value. To avoid this, you must escape
the backslash by preceding it with another backslash.

To learn more about scanning multiple collections, see the :ref:`collection
configuration property <spark-streaming-input-conf>` description.

MongoDB Connector for Spark 10.2
--------------------------------

Expand Down
67 changes: 65 additions & 2 deletions source/streaming-mode/streaming-read-config.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,10 @@ You can configure the following properties when reading data from MongoDB in str
* - ``collection``
- | **Required.**
| The collection name configuration.
| You can specify multiple collections by separating the collection names
with a comma.
|
| To learn more about specifying multiple collections, see :ref:`spark-specify-multiple-collections`.

* - ``comment``
- | The comment to append to the read operation. Comments appear in the
Expand Down Expand Up @@ -168,7 +172,7 @@ You can configure the following properties when reading a change stream from Mon
omit the ``fullDocument`` field and publishes only the value of the
field.
- If you don't specify a schema, the connector infers the schema
from the change stream document rather than from the underlying collection.
from the change stream document.

**Default**: ``false``

Expand Down Expand Up @@ -203,4 +207,63 @@ You can configure the following properties when reading a change stream from Mon
Specifying Properties in ``connection.uri``
-------------------------------------------

.. include:: /includes/connection-read-config.rst
.. include:: /includes/connection-read-config.rst

.. _spark-specify-multiple-collections:

Specifying Multiple Collections in the ``collection`` Property
--------------------------------------------------------------

You can specify multiple collections in the ``collection`` change stream
configuration property by separating the collection names
with a comma. Do not add a space between the collections unless the space is a
part of the collection name.

.. note:: Performance Considerations

When streaming from multiple collections, the connector reads from
each collection sequentially. Streaming from a large number of
collections can cause slower performance.

Specify multiple collections as shown in the following example:

.. code-block:: java

...
.option("spark.mongodb.collection", "collectionOne,collectionTwo")

.. note::

If a collection name is "*", or if the name includes a comma or a backslash (\\),
you must escape the character as follows:

- If the name of a collection used in your ``collection`` configuration
option contains a comma, the {+connector-short+} treats it as two different
collections. To avoid this, you must escape the comma by preceding it with
a backslash (\\).

- If the name of a collection used in your ``collection`` configuration
option is "*", the {+connector-short+} interprets it as a specification
to scan all collections. To avoid this, you must escape the asterisk by preceding it
with a backslash (\\).

- If the name of a collection used in your ``collection`` configuration
option contains a backslash (\\), the
{+connector-short+} treats the backslash as an escape character, which
might change how it interprets the value. To avoid this, you must escape
the backslash by preceding it with another backslash.

You can stream from all collections in the database by passing an
asterisk (*) as a string for the collection name.

Specify all collections as shown in the following example:

.. code-block:: java

...
.option("spark.mongodb.collection", "*")

If you create a collection while streaming from all collections, the new
collection is automatically included in the stream.

You can drop collections at any time while streaming from multiple collections.
20 changes: 14 additions & 6 deletions source/streaming-mode/streaming-read.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ Read from MongoDB in Streaming Mode
:depth: 1
:class: singlecol

.. facet::
:name: genre
:values: reference

.. meta::
:keywords: change stream

Overview
--------

Expand Down Expand Up @@ -344,12 +351,13 @@ The following example shows how to stream data from MongoDB to your console.

.. important:: Inferring the Schema of a Change Stream

When the {+connector-short+} infers the schema of a DataFrame
read from a change stream, by default,
it uses the schema of the underlying collection rather than that
of the change stream. If you set the ``change.stream.publish.full.document.only``
option to ``true``, the connector uses the schema of the
change stream instead.
If you set the ``change.stream.publish.full.document.only``
option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame``
by using the schema of the scanned documents. If you set the option to
``false``, you must specify a schema.

Schema inference happens at the beginning of streaming, and does not take into
account collections that are created during streaming.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information from the "Performance considerations" section from "Documentation Changes Summary" is included neither to this note, not anywhere else in the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the section here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the new text added in the PR:

When streaming from multiple collections, the connector reads from each collection sequentially. Streaming from a large number of collections cause slower performance.

I should clarify the "Performance considerations" section from from "Documentation Changes Summary", I apologize for not expressing this information more verbousely and clearly on the first attempt.

The "Performance considerations" section is a subsection of the "Schema inference" section, i.e., this section talks about the performance consideration with regard to schema inference. Note also that the term "sampling" is used alongside the term "scanning" (a.k.a. "reading"): "When scanning multiple collections, each collection is sampled sequentially." When the connector infers a schema, it $samples collections, and then infers the schema based on the sample documents. When multiple collections are involved, they are sampled sequentially ($sample works only with a single collection). If the connector is configured to read from multiple collections and to infer the schema, sampling them all sequentially may take noticeable time, i.e., schema inference may take noticeable time. However, once the schema is ready, there is no more sequential sampling and, consequently, no performance implications caused by it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification here. I think I've got it in the right spot, and adjusted the wording a bit. I put the admonition in the new "multiple collections" section and reverted the original schema inference note back to (almost) what it was previously (since that note isn't specifically about multiple collections).


For more information about this setting, and to see a full list of change stream
configuration options, see the
Expand Down