diff --git a/source/batch-mode/batch-read-config.txt b/source/batch-mode/batch-read-config.txt index d97de93..0127fec 100644 --- a/source/batch-mode/batch-read-config.txt +++ b/source/batch-mode/batch-read-config.txt @@ -135,6 +135,13 @@ You can configure the following properties when reading data from MongoDB in bat | | **Default:** ``false`` + * - ``schemaHint`` + - | Specifies a partial schema of known field types to use when inferring + the schema for the collection. To learn more about the ``schemaHint`` + option, see the :ref:`spark-schema-hint` section. + | + | **Default:** None + .. _partitioner-conf: Partitioner Configurations diff --git a/source/batch-mode/batch-read.txt b/source/batch-mode/batch-read.txt index bc59ba9..74c2fc2 100644 --- a/source/batch-mode/batch-read.txt +++ b/source/batch-mode/batch-read.txt @@ -57,6 +57,77 @@ Schema Inference .. include:: /scala/schema-inference.rst +.. _spark-schema-hint: + +Specify Known Fields with Schema Hints +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can specify a schema containing known field values to use during +schema inference by specifying the ``schemaHint`` configuration option. You can +specify the ``schemaHint`` option in any of the following Spark formats: + +.. list-table:: + :header-rows: 1 + :widths: 35 65 + + * - Type + - Format + + * - DDL + - `` , `` + + * - SQL DDL + - ``STRUCT<: , : `` + + * - JSON + - .. code-block:: json + :copyable: false + + { "type": "struct", "fields": [ + { "name": "", "type": "", "nullable": }, + { "name": "", "type": "", "nullable": }]} + +The following example shows how to specify the ``schemaHint`` option in each +format by using the Spark shell. The example specifies a string-valued field named +``"value"`` and an integer-valued field named ``"count"``. + +.. code-block:: scala + + import org.apache.spark.sql.types._ + + val mySchema = StructType(Seq( + StructField("value", StringType), + StructField("count", IntegerType)) + + // Generate DDL format + mySchema.toDDL + + // Generate SQL DDL format + mySchema.sql + + // Generate Simple String DDL format + mySchema.simpleString + + // Generate JSON format + mySchema.json + +You can also specify the ``schemaHint`` option in the Simple String DDL format, +or in JSON format by using PySpark, as shown in the following example: + +.. code-block:: python + + from pyspark.sql.types import StructType, StructField, StringType, IntegerType + + mySchema = StructType([ + StructField('value', StringType(), True), + StructField('count', IntegerType(), True)]) + + # Generate Simple String DDL format + mySchema.simpleString() + + # Generate JSON format + mySchema.json() + Filters ------- diff --git a/source/streaming-mode/streaming-read-config.txt b/source/streaming-mode/streaming-read-config.txt index 997d175..57f89a6 100644 --- a/source/streaming-mode/streaming-read-config.txt +++ b/source/streaming-mode/streaming-read-config.txt @@ -109,6 +109,13 @@ You can configure the following properties when reading data from MongoDB in str | | **Default:** ``false`` + * - ``schemaHint`` + - | Specifies a partial schema of known field types to use when inferring + the schema for the collection. To learn more about the ``schemaHint`` + option, see the :ref:`spark-schema-hint` section. + | + | **Default:** None + .. _change-stream-conf: Change Stream Configuration