DOCSP-36546 Scan Multiple Collections #193

jordan-smith721 · 2024-02-13T21:49:15Z

Pull Request Info

PR Reviewing Guidelines

JIRA - https://jira.mongodb.org/browse/DOCSP-36546
Staging -
Release Notes
Streaming Read
Streaming Read Config

Self-Review Checklist

Is this free of any warnings or errors in the RST?
Did you run a spell-check?
Did you run a grammar-check?
Are all the links working?
Are the facets and meta keywords accurate?

mongoKart

hard to judge some of the technical points, but the copy looks good. a few suggestions for clarity

source/release-notes.txt

source/streaming-mode/streaming-read-config.txt

source/streaming-mode/streaming-read.txt

source/streaming-mode/streaming-read-config.txt

source/streaming-mode/streaming-read.txt

mongoKart

lgtm!

stIncMale · 2024-02-20T17:13:42Z

source/streaming-mode/streaming-read.txt

 .. important:: Inferring the Schema of a Change Stream

-   When the {+connector-short+} infers the schema of a DataFrame
-   read from a change stream, by default,
-   it uses the schema of the underlying collection rather than that
-   of the change stream. If you set the ``change.stream.publish.full.document.only``
-   option to ``true``, the connector uses the schema of the 
-   change stream instead.
+   If you set the ``change.stream.publish.full.document.only``
+   option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame``
+   by using the schema of the scanned documents. If you set the option to
+   ``false``, you must specify a schema.
+
+   Schema inference happens at the beginning of streaming, and does not take into
+   account collections that are created during streaming.


The information from the "Performance considerations" section from "Documentation Changes Summary" is included neither to this note, not anywhere else in the PR.

Added to the section here

We should change the new text added in the PR:

When streaming from multiple collections, the connector reads from each collection sequentially. Streaming from a large number of collections cause slower performance.

I should clarify the "Performance considerations" section from from "Documentation Changes Summary", I apologize for not expressing this information more verbousely and clearly on the first attempt.

The "Performance considerations" section is a subsection of the "Schema inference" section, i.e., this section talks about the performance consideration with regard to schema inference. Note also that the term "sampling" is used alongside the term "scanning" (a.k.a. "reading"): "When scanning multiple collections, each collection is sampled sequentially." When the connector infers a schema, it $samples collections, and then infers the schema based on the sample documents. When multiple collections are involved, they are sampled sequentially ($sample works only with a single collection). If the connector is configured to read from multiple collections and to infer the schema, sampling them all sequentially may take noticeable time, i.e., schema inference may take noticeable time. However, once the schema is ready, there is no more sequential sampling and, consequently, no performance implications caused by it.

Thanks for the clarification here. I think I've got it in the right spot, and adjusted the wording a bit. I put the admonition in the new "multiple collections" section and reverted the original schema inference note back to (almost) what it was previously (since that note isn't specifically about multiple collections).

source/streaming-mode/streaming-read-config.txt

stIncMale

.

stIncMale · 2024-02-28T01:22:58Z

source/streaming-mode/streaming-read-config.txt

+.. important:: Inferring the Schema of a Change Stream
+
+   If you set the ``change.stream.publish.full.document.only``
+   option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame``
+   by using the schema of the scanned documents. If you set the option to
+   ``false``, you must specify a schema.


[optional]

This part of the note duplicates the note in streaming-read.txt. While I find it weird to have such duplication, it's up to the docs team. However, the source code for both notes is also duplicated, which opens opportunities to update one of the notes and not the other, thus causing confusion.

There were a few minor differences in the two different admonitions, so unfortunately I couldn't single-source them. If this does begin causing issues we can look at rewording/moving things (or maybe just removing one of the duplicates)

source/streaming-mode/streaming-read-config.txt

stIncMale · 2024-02-28T02:25:33Z

source/streaming-mode/streaming-read-config.txt

+   When streaming from multiple collections, the connector samples
+   each collection sequentially. Streaming from a large number of


[optional]

My original wording in the "Documentation Changes Summary" was proven to be unclear, and now I am afraid to leave users confused. I am suggesting the following additions:

Suggested change

When streaming from multiple collections, the connector samples

each collection sequentially. Streaming from a large number of

When streaming from multiple collections, and inferring the schema,

the connector samples each collection sequentially

as part of the schema inference. Streaming from a large number of

Added to this to make it extra clear

stIncMale

Approving because the only outstanding suggestions are optional.

(cherry picked from commit 23deda1)

This reverts commit e206f09.

jordan-smith721 added 5 commits February 13, 2024 12:38

first draft

a508720

clarifications and rendering issues

2f9efbd

typos

ada0f3e

typo

4b725e8

taxonomy

a2f5b51

mongoKart requested changes Feb 13, 2024

View reviewed changes

Mike feedback

eddf93c

jordan-smith721 requested a review from mongoKart February 14, 2024 15:14

mongoKart approved these changes Feb 14, 2024

View reviewed changes

rozza requested a review from stIncMale February 20, 2024 15:56

stIncMale suggested changes Feb 20, 2024

View reviewed changes

jordan-smith721 added 4 commits February 20, 2024 10:57

Add missing info and move collection info to its own section

fa8bd86

indentation errors

e81bc1e

wording

6f80c56

typo

b3e46c1

jordan-smith721 requested a review from stIncMale February 20, 2024 19:29

stIncMale suggested changes Feb 23, 2024

View reviewed changes

jordan-smith721 added 2 commits February 26, 2024 08:59

adding additional information and examples

55ed099

fix escaping examples

2553f13

jordan-smith721 requested a review from stIncMale February 26, 2024 17:25

stIncMale reviewed Feb 28, 2024

View reviewed changes

stIncMale suggested changes Feb 28, 2024

View reviewed changes

stIncMale approved these changes Feb 28, 2024

View reviewed changes

small fixes

c737f06

jordan-smith721 merged commit 23deda1 into mongodb:master Feb 28, 2024

jordan-smith721 deleted the DOCSP-36546-multiple-collections-support branch February 28, 2024 17:35

jordan-smith721 added a commit that referenced this pull request Feb 28, 2024

DOCSP-36546 Scan Multiple Collections (#193)

e206f09

(cherry picked from commit 23deda1)

jordan-smith721 added a commit that referenced this pull request Feb 28, 2024

Revert "DOCSP-36546 Scan Multiple Collections (#193)"

79babb8

This reverts commit e206f09.

		When streaming from multiple collections, the connector samples
		each collection sequentially. Streaming from a large number of

-   When streaming from multiple collections, the connector samples
-   each collection sequentially. Streaming from a large number of
+   When streaming from multiple collections, and inferring the schema,
+   the connector samples each collection sequentially
+   as part of the schema inference. Streaming from a large number of

DOCSP-36546 Scan Multiple Collections #193

DOCSP-36546 Scan Multiple Collections #193

Uh oh!

Conversation

jordan-smith721 commented Feb 13, 2024

Pull Request Info

Self-Review Checklist

Uh oh!

mongoKart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mongoKart left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stIncMale left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants