-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOCSP-47835 Clarify batchsize behavior #215
DOCSP-47835 Clarify batchsize behavior #215
Conversation
✅ Deploy Preview for docs-php-library ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left some comments that should give context to revise this PR
If batchSize is set, ``aggregate`` and any subsequent ``getMore`` commands returns the smaller of 16 megabytes of data | ||
or batchSize documents. Operations of type ``find``, ``aggregate``, ``listIndexes``, and ``listCollections`` return a maximum of ``16`` megabytes per batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S: these middle sentences dont communicate the information effectively.
Mu understanding is that this description should explain the following points:
- Specifies the batch size for the cursor
- Applies to the initial command and requests for more batches
- Cannot increase the batch limit beyond 16mb and can only decrease/limit the batch size
- If you set the batch size to a # that exceeds the sum of 16 MB in the batch, the option has no effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: to my knowledge, the default size that applies to the initial command (101 documents) and the default size (up to 16 mb) for requests for more batches is different. Should I clarify that as well? I was a little confused on the sections about batchsize in the find() and aggregate() function pages especially, because those to my knowledge are only initial commands (as getmore is the request for more batches). So in the find() and aggregate() pages, do you think it makes sense to only talk about the initial command and not the requests for more batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works like this -
- default setting - 101 documents for first batch, 16 mb limit for subsequent
- custom setting (lets say 60) - 60 documents in first batch, 60 documents in subsequent batches as long as the total data size does not exceed 16 mb
When using the driver, users do not need to manually run a getMore
command. The driver takes care of this internally. Thus, even though it is under the hood, the batch size would still apply to both the first and subsequent batches AFAIK. @jmikola do you have any insight into this clarification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting, so getMore is not applicable at all in drivers contexts? Good to know.
Also, by total data size, do you mean added up the total size of all of the batches, or that the size of each individual batch has to be under 16 mb but the total size of batches combined can be > 16 mb? I interpreted it as each individual batch has to be under 16 mb but the total size doesn't matter.
I also thought that if we had a custom setting it would apply only to documents after the first batch, given this line from the engineer discussing issues in the batchSize docs in the slack thread linked in the ticket: "batchSize is about setting a ceiling for all batches after the first (which is 101)". However, would love clarification from Jeremy as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, by total data size, do you mean added up the total size of all of the batches, or that the size of each individual batch has to be under 16 mb but the total size of batches combined can be > 16 mb? I interpreted it as each individual batch has to be under 16 mb but the total size doesn't matter.
AFAIK, there is no documented restriction on the total size of all batches. The 16 MiB limit applies to each batch, and is discussed in the batchSize
docs for the getMore
command.
I can confirm that applications do not manually run getMore
commands. Drivers take care of that when iterating a cursor.
@rustagir's understanding about the default and custom setting in this comment is correct.
Relevant docs within drivers for this would batchSize
within the AggregateOptions and FindOptions sections of our CRUD spec, which talk about inheriting the option for subsequent getMore
commands.
Additionally, our find/getMore/killCursors spec goes into more detail about how batchSize
interacts with limit
and skip
. For example, when a limit
is specified drivers may use a calculated value for batchSize
to ensure we don't return more documents than necessary and guarantee that the server closes the cursor. This isn't something we need to rehash in driver docs, though, since it's handled internally. Furthermore, the server docs don't seem to talk about that either (in cursor.limit()
and cursor.batchSize()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jeremy! Followup about MiB vs MB: is that difference salient in the docs? The internal engineering slack convo/docs refer to MiB rather than MB, but the php driver docs all refer to MB (16 MB vs 16 MiB). In your knowledge, would that difference be important to engineers using the batchsize command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect the references to "MB" in the existing PHP docs are just old, and could pre-date when the server manual started using "MiB" consistently. There is no reason these can't all be changed to "MiB", although that needn't be done in this PR.
Anything relevant to this PR should use "MiB", though, so we stop introducing more instances of "MB".
source/read/retrieve.txt
Outdated
- | The limit for each batch returned in a query result. The ``find`` method has an initial batch size of ``101`` documents by default and returns a maximum of 16 | ||
megabytes per batch. Subsequent ``getMore`` operations issued against the resulting cursor have no default batch size, | ||
so they are limited only by the ``16`` megabyte message size. batchSize can enforce a smaller limit than ``16`` megabytes, but not a larger one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S: this revision seems to address the issue most clearly. Please also install the VS code extension for line wrapping
- | The limit for each batch returned in a query result. The ``find`` method has an initial batch size of ``101`` documents by default and returns a maximum of 16 | |
megabytes per batch. Subsequent ``getMore`` operations issued against the resulting cursor have no default batch size, | |
so they are limited only by the ``16`` megabyte message size. batchSize can enforce a smaller limit than ``16`` megabytes, but not a larger one. | |
- | The size limit for the batches returned in a query result. By default, | |
the ``find()`` method has an initial batch size of ``101`` documents | |
and a maximum of 16 megabytes for each subsequent batch. This | |
option can enforce a smaller limit than 16 megabytes, but not a larger | |
one. If you set ``batchSize`` to a limit that results in batches larger than | |
16 MB, this option has no effect. |
and a maximum of 16 megabytes for each subsequent batch. This | ||
option can enforce a smaller limit than 16 megabytes, but not a larger | ||
one. If you set ``batchSize`` to a limit that results in batches larger than | ||
16 MB, this option has no effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change streams have nothing to do with the find()
method. I think we should preserve the original wording that referred to aggregate
and "change events", since this is a snippet for watch()
methods.
These use the aggregate
command and $changeStream
stage. A batch size for aggregate
is expressed via the cursor.batchSize
option on the outgoing command. For getMore
commands (common to both aggregate
and find
), batchSize
is a top-level option.
In the interest of correctness and consistency (with other docs), we should refer to "mebibytes" or "MiB" instead of "megabytes" and "MB".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding on to this, I think it'd be more technically correct to refer to "aggregate
command" instead of "watch()
method". That helps communicate to users that this driver method is actually running an aggregate
command on the server, and I think it's helpful since they can then cross-reference with the server's own API docs for commands.
I understand this might not be desirable in prose-style docs (e.g. read/retrieve.txt
), but this snippet is included in our API reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, left a question above about MB vs MiB but saw you already answered it here! Feel free to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmikola For clarification, in this comment "Adding on to this, I think it'd be more technically correct to refer to "aggregate
command" instead of "watch()
method". That helps communicate to users that this driver method is actually running an aggregate command on the server, and I think it's helpful since they can then cross-reference with the server's own API docs for commands," when you say the "watch()
method" did you mean the "find()
method"? I don't think this doc refers to the watch() method. Just wanted to clarify to be sure I'm understanding correctly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is extracts-watch-option.yaml
and included in the docs for watch()
methods on Client, Database, and Collection objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I think I misunderstood the comment and understand what you're saying now. Thanks for the help!
@@ -52,9 +52,12 @@ Parameters | |||
|
|||
* - batchSize | |||
- integer | |||
- The number of documents to return in the first batch. Defaults to | |||
``101``. A batchSize of ``0`` means that the cursor will be | |||
established, but no documents will be returned in the first batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The special behavior of batchSize: 0
is notable, and also discussed in the find
command docs. Particularly with aggregate
operations, there is a legitimate use case for wanting to start the aggregation on the server without waiting for the first result to become available. Allowing a cursor to be created immediately and iterated later (via batchSize: 0
satisfies that).
- Specifies the batch size for the cursor, which will apply to both the | ||
initial ``aggregate`` command and any subsequent ``getMore`` commands. | ||
This determines the maximum number of documents to return in each | ||
response from the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can discuss batchSize: 0
here as we do for the find()
option, since there is a legitimate use case for doing so. I'm OK with keeping this in the API reference and out of the prose docs (e.g. read/retrieve.txt
), since it's more technical.
There is also a special directive in the CRUD spec for drivers to intentionally omit batchSize
for pipelines that include $out
or $merge
, since it might prevent the pipeline from executing (and writing output). I don't think we need to mention that, though, as it is enforced internally (see: Aggregate.php).
If batchSize is set, ``aggregate`` and any subsequent ``getMore`` commands returns the smaller of 16 megabytes of data | ||
or batchSize documents. Operations of type ``find``, ``aggregate``, ``listIndexes``, and ``listCollections`` return a maximum of ``16`` megabytes per batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, by total data size, do you mean added up the total size of all of the batches, or that the size of each individual batch has to be under 16 mb but the total size of batches combined can be > 16 mb? I interpreted it as each individual batch has to be under 16 mb but the total size doesn't matter.
AFAIK, there is no documented restriction on the total size of all batches. The 16 MiB limit applies to each batch, and is discussed in the batchSize
docs for the getMore
command.
I can confirm that applications do not manually run getMore
commands. Drivers take care of that when iterating a cursor.
@rustagir's understanding about the default and custom setting in this comment is correct.
Relevant docs within drivers for this would batchSize
within the AggregateOptions and FindOptions sections of our CRUD spec, which talk about inheriting the option for subsequent getMore
commands.
Additionally, our find/getMore/killCursors spec goes into more detail about how batchSize
interacts with limit
and skip
. For example, when a limit
is specified drivers may use a calculated value for batchSize
to ensure we don't return more documents than necessary and guarantee that the server closes the cursor. This isn't something we need to rehash in driver docs, though, since it's handled internally. Furthermore, the server docs don't seem to talk about that either (in cursor.limit()
and cursor.batchSize()
).
For additional context, I want to cite the docs from Cursor Batches, which the OP in DOCSP-47826 suggested was most clear on the subject:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few small fixes but I think the changes look good!
the maximum number of change events to return in each response from the | ||
The maximum number of documents within each batch returned in a query result, which applies | ||
to the ``aggregate`` command. By default, the ``aggregate`` command has an initial batch size of | ||
``101`` documents and a maximum size of 16 mebibytes for each subsequent batch. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``101`` documents and a maximum size of 16 mebibytes for each subsequent batch. This | |
``101`` documents and a maximum size of 16 mebibytes (MiB) for each subsequent batch. This |
The maximum number of documents within each batch returned in a query result, which applies | ||
to the ``aggregate`` command. By default, the ``aggregate`` command has an initial batch size of | ||
``101`` documents and a maximum size of 16 mebibytes for each subsequent batch. This | ||
option can enforce a smaller limit than 16 mebibytes, but not a larger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
option can enforce a smaller limit than 16 mebibytes, but not a larger | |
option can enforce a smaller limit than 16 MiB, but not a larger |
This command determines the maximum number of change events to return in each response from the | ||
server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: should you also remove this sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out, I'll remove!
source/read/retrieve.txt
Outdated
and a maximum size of 16 mebibytes for each subsequent batch. This | ||
option can enforce a smaller limit than 16 mebibytes, but not a larger | ||
one. If you set ``batchSize`` to a limit that results in batches larger than | ||
16 MiB, this option has no effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and a maximum size of 16 mebibytes for each subsequent batch. This | |
option can enforce a smaller limit than 16 mebibytes, but not a larger | |
one. If you set ``batchSize`` to a limit that results in batches larger than | |
16 MiB, this option has no effect. | |
and a maximum size of 16 mebibytes (MiB) for each subsequent batch. This | |
option can enforce a smaller limit than 16 MiB, but not a larger | |
one. If you set ``batchSize`` to a limit that results in batches larger than | |
16 MiB, this option has no effect. |
command has an initial batch size of ``101`` documents. This | ||
option can enforce a smaller limit than 16 mebibytes, but not a larger | ||
one. If you set ``batchSize`` to a limit that results in batches larger than | ||
16 MiB, this option has no effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
command has an initial batch size of ``101`` documents. This | |
option can enforce a smaller limit than 16 mebibytes, but not a larger | |
one. If you set ``batchSize`` to a limit that results in batches larger than | |
16 MiB, this option has no effect. | |
command has an initial batch size of ``101`` documents. This | |
option can enforce a smaller limit than 16 mebibytes (MiB), but not a larger | |
one. If you set ``batchSize`` to a limit that results in batches larger than | |
16 MiB, this option has no effect. |
|
||
A batchSize of ``0`` means that the cursor will be established, but no documents | ||
will be returned in the first batch. This may be useful for quickly returning a cursor | ||
or failure from ``aggregate`` without doing significant server-side work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I: remove aggregate
from this description as its about the find method. I think JM meant that setting this option is useful for aggregation contexts but that users might use it for finds too
``101``. A batchSize of ``0`` means that the cursor will be | ||
established, but no documents will be returned in the first batch. | ||
- The maximum number of documents to return in the first batch. By default, the ``find()`` | ||
command has an initial batch size of ``101`` documents. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely specify:
...and a maximum size of 16 mebibytes for each subsequent batch
As-is, it's inconsistent with the other docs snippets and could mislead users into thinking this has no effect on getMore
for subsequent batches.
source/reference/method/MongoDBCollection-listSearchIndexes.txt
Outdated
Show resolved
Hide resolved
Irrespective of the ``batchSize`` option, the initial ``aggregate`` command | ||
response for a change stream generally does not include any documents | ||
unless another option is used to configure its starting point (e.g. | ||
``startAfter``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice this earlier, but this paragraph seems like a relevant bit of information to preserve.
- The maximum number of documents within each batch returned in a query result, which applies | ||
to the ``aggregate`` command. By default, the ``aggregate`` command has an initial batch size of | ||
``101`` documents and a maximum size of 16 mebibytes (MiB) for each subsequent batch. This | ||
option can enforce a smaller limit than 16 MiB, but not a larger | ||
one. If you set ``batchSize`` to a limit that results in batches larger than | ||
16 MiB, this option has no effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The maximum number of documents within each batch returned in a query result, which applies | |
to the ``aggregate`` command. By default, the ``aggregate`` command has an initial batch size of | |
``101`` documents and a maximum size of 16 mebibytes (MiB) for each subsequent batch. This | |
option can enforce a smaller limit than 16 MiB, but not a larger | |
one. If you set ``batchSize`` to a limit that results in batches larger than | |
16 MiB, this option has no effect. | |
- The maximum number of documents within each batch returned in the indexes list, which applies | |
to the ``aggregate`` command. By default, the ``aggregate`` command has an initial batch size of | |
``101`` documents and a maximum size of 16 mebibytes (MiB) for each subsequent batch. This | |
option can enforce a smaller limit than 16 MiB, but not a larger | |
one. If you set ``batchSize`` to a limit that results in batches larger than | |
16 MiB, this option has no effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! good work on this technical update
source/read/retrieve.txt
Outdated
@@ -180,7 +180,12 @@ you can set in the array: | |||
- Description | |||
|
|||
* - ``batchSize`` | |||
- | The number of documents to return per batch. The default value is ``101``. | |||
- | The maximum number of documents within each batch returned in a query result. By default, | |||
the ``find()`` method has an initial batch size of ``101`` documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this section is talking about both find()
and findOne()
, I think this can just refer to "the find
command" (used by both helpers).
Therefore, this snippet will just duplicate the first paragraph from the batchSize
docs in MongoDBCollection-find.txt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defer to you if "find()
method" is better changed to "find
command".
I'll leave a preemptive LGTM.
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-v1.20 v1.20
# Navigate to the new working tree
cd .worktrees/backport-v1.20
# Create a new branch
git switch --create backport-215-to-v1.20
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 f3cce81f8daeda749df7524e003b1e89c9e2d615
# Push it to GitHub
git push --set-upstream origin backport-215-to-v1.20
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-v1.20 Then, create a pull request where the |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-v1.21 v1.21
# Navigate to the new working tree
cd .worktrees/backport-v1.21
# Create a new branch
git switch --create backport-215-to-v1.21
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 f3cce81f8daeda749df7524e003b1e89c9e2d615
# Push it to GitHub
git push --set-upstream origin backport-215-to-v1.21
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-v1.21 Then, create a pull request where the |
…chsize DOCSP-47835 Clarify batchsize behavior
…chsize DOCSP-47835 Clarify batchsize behavior
…chsize DOCSP-47835 Clarify batchsize behavior
…chsize DOCSP-47835 Clarify batchsize behavior
Merge pull request #215 from shuangela/DOCSP-47835-clarify-batchsize
Pull Request Info
PR Reviewing Guidelines
JIRA - https://jira.mongodb.org/browse/DOCSP-47835
Staging - https://deploy-preview-215--docs-php-library.netlify.app/reference/method/MongoDBCollection-listSearchIndexes/#parameters
https://deploy-preview-215--docs-php-library.netlify.app/reference/method/MongoDBDatabase-aggregate/#parameters
https://deploy-preview-215--docs-php-library.netlify.app/reference/method/MongoDBCollection-find/#parameters
https://deploy-preview-215--docs-php-library.netlify.app/reference/method/MongoDBDatabase-watch/#parameters
https://deploy-preview-215--docs-php-library.netlify.app/read/retrieve/#modify-find-behavior
Note: This topic is a bit over my head in terms of technicalities of the batchsize parameter. I read the slack thread provided in the ticket as well as found the doc that I linked in the ticket about batchsize, but any suggestions about clarity or accuracy are greatly appreciated!
Self-Review Checklist