Skip to content

DOCSP-41049 schemaHint option #205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions source/batch-mode/batch-read-config.txt
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,13 @@ You can configure the following properties when reading data from MongoDB in bat
|
| **Default:** ``false``

* - ``schemaHint``
- | Specifies a partial schema of known field types to use when inferring
the schema for the collection. To learn more about the ``schemaHint``
option, see the :ref:`spark-schema-hint` section.
|
| **Default:** None

.. _partitioner-conf:

Partitioner Configurations
Expand Down
71 changes: 71 additions & 0 deletions source/batch-mode/batch-read.txt
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,77 @@ Schema Inference

.. include:: /scala/schema-inference.rst

.. _spark-schema-hint:

Specify Known Fields with Schema Hints
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can specify a schema containing known field values to use during
schema inference by specifying the ``schemaHint`` configuration option. You can
specify the ``schemaHint`` option in any of the following Spark formats:

.. list-table::
:header-rows: 1
:widths: 35 65

* - Type
- Format

* - DDL
- ``<field one name> <FIELD ONE TYPE>, <field two name> <FIELD TWO TYPE>``

* - SQL DDL
- ``STRUCT<<field one name>: <FIELD ONE TYPE>, <field two name>: <FIELD TWO TYPE>``

* - JSON
- .. code-block:: json
:copyable: false

{ "type": "struct", "fields": [
{ "name": "<field name>", "type": "<field type>", "nullable": <true/false> },
{ "name": "<field name>", "type": "<field type>", "nullable": <true/false> }]}

The following example shows how to specify the ``schemaHint`` option in each
format by using the Spark shell. The example specifies a string-valued field named
``"value"`` and an integer-valued field named ``"count"``.

.. code-block:: scala

import org.apache.spark.sql.types._

val mySchema = StructType(Seq(
StructField("value", StringType),
StructField("count", IntegerType))

// Generate DDL format
mySchema.toDDL

// Generate SQL DDL format
mySchema.sql

// Generate Simple String DDL format
mySchema.simpleString

// Generate JSON format
mySchema.json

You can also specify the ``schemaHint`` option in the Simple String DDL format,
or in JSON format by using PySpark, as shown in the following example:

.. code-block:: python

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

mySchema = StructType([
StructField('value', StringType(), True),
StructField('count', IntegerType(), True)])

# Generate Simple String DDL format
mySchema.simpleString()

# Generate JSON format
mySchema.json()

Filters
-------

Expand Down
7 changes: 7 additions & 0 deletions source/streaming-mode/streaming-read-config.txt
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,13 @@ You can configure the following properties when reading data from MongoDB in str
|
| **Default:** ``false``

* - ``schemaHint``
- | Specifies a partial schema of known field types to use when inferring
the schema for the collection. To learn more about the ``schemaHint``
option, see the :ref:`spark-schema-hint` section.
|
| **Default:** None

.. _change-stream-conf:

Change Stream Configuration
Expand Down
Loading