Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCSP-36546 Scan Multiple Collections #193

Conversation

jordan-smith721
Copy link
Contributor

Pull Request Info

PR Reviewing Guidelines

JIRA - https://jira.mongodb.org/browse/DOCSP-36546
Staging -
Release Notes
Streaming Read
Streaming Read Config

Self-Review Checklist

  • Is this free of any warnings or errors in the RST?
  • Did you run a spell-check?
  • Did you run a grammar-check?
  • Are all the links working?
  • Are the facets and meta keywords accurate?

Copy link
Contributor

@mongoKart mongoKart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to judge some of the technical points, but the copy looks good. a few suggestions for clarity

Copy link
Contributor

@mongoKart mongoKart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@rozza rozza requested a review from stIncMale February 20, 2024 15:56
Comment on lines 352 to 360
.. important:: Inferring the Schema of a Change Stream

When the {+connector-short+} infers the schema of a DataFrame
read from a change stream, by default,
it uses the schema of the underlying collection rather than that
of the change stream. If you set the ``change.stream.publish.full.document.only``
option to ``true``, the connector uses the schema of the
change stream instead.
If you set the ``change.stream.publish.full.document.only``
option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame``
by using the schema of the scanned documents. If you set the option to
``false``, you must specify a schema.

Schema inference happens at the beginning of streaming, and does not take into
account collections that are created during streaming.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information from the "Performance considerations" section from "Documentation Changes Summary" is included neither to this note, not anywhere else in the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the section here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the new text added in the PR:

When streaming from multiple collections, the connector reads from each collection sequentially. Streaming from a large number of collections cause slower performance.

I should clarify the "Performance considerations" section from from "Documentation Changes Summary", I apologize for not expressing this information more verbousely and clearly on the first attempt.

The "Performance considerations" section is a subsection of the "Schema inference" section, i.e., this section talks about the performance consideration with regard to schema inference. Note also that the term "sampling" is used alongside the term "scanning" (a.k.a. "reading"): "When scanning multiple collections, each collection is sampled sequentially." When the connector infers a schema, it $samples collections, and then infers the schema based on the sample documents. When multiple collections are involved, they are sampled sequentially ($sample works only with a single collection). If the connector is configured to read from multiple collections and to infer the schema, sampling them all sequentially may take noticeable time, i.e., schema inference may take noticeable time. However, once the schema is ready, there is no more sequential sampling and, consequently, no performance implications caused by it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification here. I think I've got it in the right spot, and adjusted the wording a bit. I put the admonition in the new "multiple collections" section and reverted the original schema inference note back to (almost) what it was previously (since that note isn't specifically about multiple collections).

Copy link
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Comment on lines 285 to 290
.. important:: Inferring the Schema of a Change Stream

If you set the ``change.stream.publish.full.document.only``
option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame``
by using the schema of the scanned documents. If you set the option to
``false``, you must specify a schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional]

This part of the note duplicates the note in streaming-read.txt. While I find it weird to have such duplication, it's up to the docs team. However, the source code for both notes is also duplicated, which opens opportunities to update one of the notes and not the other, thus causing confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were a few minor differences in the two different admonitions, so unfortunately I couldn't single-source them. If this does begin causing issues we can look at rewording/moving things (or maybe just removing one of the duplicates)

Comment on lines 295 to 296
each collection sequentially. Streaming from a large number of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional]

My original wording in the "Documentation Changes Summary" was proven to be unclear, and now I am afraid to leave users confused. I am suggesting the following additions:

Suggested change
When streaming from multiple collections, the connector samples
each collection sequentially. Streaming from a large number of
When streaming from multiple collections, and inferring the schema,
the connector samples each collection sequentially
as part of the schema inference. Streaming from a large number of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to this to make it extra clear

Copy link
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving because the only outstanding suggestions are optional.

@jordan-smith721 jordan-smith721 merged commit 23deda1 into mongodb:master Feb 28, 2024
1 of 2 checks passed
@jordan-smith721 jordan-smith721 deleted the DOCSP-36546-multiple-collections-support branch February 28, 2024 17:35
jordan-smith721 added a commit that referenced this pull request Feb 28, 2024
jordan-smith721 added a commit that referenced this pull request Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants