-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOCSP-36546 Scan Multiple Collections #193
DOCSP-36546 Scan Multiple Collections #193
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to judge some of the technical points, but the copy looks good. a few suggestions for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
.. important:: Inferring the Schema of a Change Stream | ||
|
||
When the {+connector-short+} infers the schema of a DataFrame | ||
read from a change stream, by default, | ||
it uses the schema of the underlying collection rather than that | ||
of the change stream. If you set the ``change.stream.publish.full.document.only`` | ||
option to ``true``, the connector uses the schema of the | ||
change stream instead. | ||
If you set the ``change.stream.publish.full.document.only`` | ||
option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame`` | ||
by using the schema of the scanned documents. If you set the option to | ||
``false``, you must specify a schema. | ||
|
||
Schema inference happens at the beginning of streaming, and does not take into | ||
account collections that are created during streaming. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The information from the "Performance considerations" section from "Documentation Changes Summary" is included neither to this note, not anywhere else in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the section here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change the new text added in the PR:
When streaming from multiple collections, the connector reads from each collection sequentially. Streaming from a large number of collections cause slower performance.
I should clarify the "Performance considerations" section from from "Documentation Changes Summary", I apologize for not expressing this information more verbousely and clearly on the first attempt.
The "Performance considerations" section is a subsection of the "Schema inference" section, i.e., this section talks about the performance consideration with regard to schema inference. Note also that the term "sampling" is used alongside the term "scanning" (a.k.a. "reading"): "When scanning multiple collections, each collection is sampled sequentially." When the connector infers a schema, it $sample
s collections, and then infers the schema based on the sample documents. When multiple collections are involved, they are sampled sequentially ($sample
works only with a single collection). If the connector is configured to read from multiple collections and to infer the schema, sampling them all sequentially may take noticeable time, i.e., schema inference may take noticeable time. However, once the schema is ready, there is no more sequential sampling and, consequently, no performance implications caused by it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification here. I think I've got it in the right spot, and adjusted the wording a bit. I put the admonition in the new "multiple collections" section and reverted the original schema inference note back to (almost) what it was previously (since that note isn't specifically about multiple collections).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
.. important:: Inferring the Schema of a Change Stream | ||
|
||
If you set the ``change.stream.publish.full.document.only`` | ||
option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame`` | ||
by using the schema of the scanned documents. If you set the option to | ||
``false``, you must specify a schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[optional]
This part of the note duplicates the note in streaming-read.txt
. While I find it weird to have such duplication, it's up to the docs team. However, the source code for both notes is also duplicated, which opens opportunities to update one of the notes and not the other, thus causing confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were a few minor differences in the two different admonitions, so unfortunately I couldn't single-source them. If this does begin causing issues we can look at rewording/moving things (or maybe just removing one of the duplicates)
each collection sequentially. Streaming from a large number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[optional]
My original wording in the "Documentation Changes Summary" was proven to be unclear, and now I am afraid to leave users confused. I am suggesting the following additions:
When streaming from multiple collections, the connector samples | |
each collection sequentially. Streaming from a large number of | |
When streaming from multiple collections, and inferring the schema, | |
the connector samples each collection sequentially | |
as part of the schema inference. Streaming from a large number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to this to make it extra clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving because the only outstanding suggestions are optional.
(cherry picked from commit 23deda1)
This reverts commit e206f09.
Pull Request Info
PR Reviewing Guidelines
JIRA - https://jira.mongodb.org/browse/DOCSP-36546
Staging -
Release Notes
Streaming Read
Streaming Read Config
Self-Review Checklist