DOCSP-36546 Scan Multiple Collections#193
Conversation
mongoKart
left a comment
There was a problem hiding this comment.
hard to judge some of the technical points, but the copy looks good. a few suggestions for clarity
| .. important:: Inferring the Schema of a Change Stream | ||
|
|
||
| When the {+connector-short+} infers the schema of a DataFrame | ||
| read from a change stream, by default, | ||
| it uses the schema of the underlying collection rather than that | ||
| of the change stream. If you set the ``change.stream.publish.full.document.only`` | ||
| option to ``true``, the connector uses the schema of the | ||
| change stream instead. | ||
| If you set the ``change.stream.publish.full.document.only`` | ||
| option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame`` | ||
| by using the schema of the scanned documents. If you set the option to | ||
| ``false``, you must specify a schema. | ||
|
|
||
| Schema inference happens at the beginning of streaming, and does not take into | ||
| account collections that are created during streaming. |
There was a problem hiding this comment.
The information from the "Performance considerations" section from "Documentation Changes Summary" is included neither to this note, not anywhere else in the PR.
There was a problem hiding this comment.
We should change the new text added in the PR:
When streaming from multiple collections, the connector reads from each collection sequentially. Streaming from a large number of collections cause slower performance.
I should clarify the "Performance considerations" section from from "Documentation Changes Summary", I apologize for not expressing this information more verbousely and clearly on the first attempt.
The "Performance considerations" section is a subsection of the "Schema inference" section, i.e., this section talks about the performance consideration with regard to schema inference. Note also that the term "sampling" is used alongside the term "scanning" (a.k.a. "reading"): "When scanning multiple collections, each collection is sampled sequentially." When the connector infers a schema, it $samples collections, and then infers the schema based on the sample documents. When multiple collections are involved, they are sampled sequentially ($sample works only with a single collection). If the connector is configured to read from multiple collections and to infer the schema, sampling them all sequentially may take noticeable time, i.e., schema inference may take noticeable time. However, once the schema is ready, there is no more sequential sampling and, consequently, no performance implications caused by it.
There was a problem hiding this comment.
Thanks for the clarification here. I think I've got it in the right spot, and adjusted the wording a bit. I put the admonition in the new "multiple collections" section and reverted the original schema inference note back to (almost) what it was previously (since that note isn't specifically about multiple collections).
| .. important:: Inferring the Schema of a Change Stream | ||
|
|
||
| If you set the ``change.stream.publish.full.document.only`` | ||
| option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame`` | ||
| by using the schema of the scanned documents. If you set the option to | ||
| ``false``, you must specify a schema. |
There was a problem hiding this comment.
[optional]
This part of the note duplicates the note in streaming-read.txt. While I find it weird to have such duplication, it's up to the docs team. However, the source code for both notes is also duplicated, which opens opportunities to update one of the notes and not the other, thus causing confusion.
There was a problem hiding this comment.
There were a few minor differences in the two different admonitions, so unfortunately I couldn't single-source them. If this does begin causing issues we can look at rewording/moving things (or maybe just removing one of the duplicates)
| When streaming from multiple collections, the connector samples | ||
| each collection sequentially. Streaming from a large number of |
There was a problem hiding this comment.
[optional]
My original wording in the "Documentation Changes Summary" was proven to be unclear, and now I am afraid to leave users confused. I am suggesting the following additions:
| When streaming from multiple collections, the connector samples | |
| each collection sequentially. Streaming from a large number of | |
| When streaming from multiple collections, and inferring the schema, | |
| the connector samples each collection sequentially | |
| as part of the schema inference. Streaming from a large number of |
There was a problem hiding this comment.
Added to this to make it extra clear
stIncMale
left a comment
There was a problem hiding this comment.
Approving because the only outstanding suggestions are optional.
(cherry picked from commit 23deda1)
This reverts commit e206f09.
Pull Request Info
PR Reviewing Guidelines
JIRA - https://jira.mongodb.org/browse/DOCSP-36546
Staging -
Release Notes
Streaming Read
Streaming Read Config
Self-Review Checklist