A list of most operations in SpatialHadoop

SpatialHadoop ships with many spatial operations that work efficiently on a cluster and scales up to terabytes of data. This page describes these operations with all possible parameters and customizations. You can list down all supported operations by types bin/shadoop. You can also get a quick description and arguments for a specific operation by typing bin/shadoop <operation>

Index

The index command is used to construct a spatial index for a given input file. Currently, there are three spatial indexes which can be constructed in SpatialHadoop, namely, Grid, R-tree and R+-tree.

A grid index partitions the data according to a uniform grid based on the input file size. For uniform data, this index is expected to partition the input file, into equally sized partition, each of size 64MB. If an input record (e.g., rectangle) overlaps multiple partitions, it is replicated to all overlapping partitions. R-tree and R+-tree are both used for skewed input data where a uniform grid does not work properly. In R-tree, each input record is stored in only one partition. The boundaries of each partition may need to expand to enclose all contained records which means that partitions might overlap. On the other side, in R+-tree, partitions are kept disjoint while records are replicated to all overlapping partitions.

  bin/shadoop index <input> <output> shape:<input format>
     sindex:<index> blocksize:<size> -overwrite

Arguments

input and output: Path to input and output files
shape:input format: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
sindex:index: Type of index to construct, either grid, rtree or r+tree.
blocksize:size: Block size for generated file. By default, it uses the default block size in the output directory.
-overwrite: If provided, the output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Range query

The rangequery command is used to run a spatial range query for an input file. If the input file is indexed, the index will be accessed to speed up the execution of the query. Otherwise, the whole file will be scanned to retrieve matching records.

bin/shadoop rangequery <input> <output> shape:<input format>
     rect:<rectangle> -overwrite

Arguments

input and output: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important.
shape:input format: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
rect:rectangle: Dimensions of the query rectangle in the format x1,y1,x2,y2 where (x1,y1) is the coordinate of the minimum corner of the query ractangle while (x2,y2) is the coordinate of the maximum corner of it.
-overwrite: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

k Nearest Neighbor

he knn operation is used to run a k nearest neighbor query for an input file. Given a query point Q, this operation returns the k closest records to Q. If the input file is indexed, the index will be accessed to speed up the execution of the query. Otherwise, the whole file will be scanned to retrieve correct answer.

  bin/shadoop knn <input> <output> shape:<input format>
     point:<x,y> k:<k> -overwrite

Arguments

input and output: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important.
shape:input format: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
point:x,y: Coordinates of the query point.
k:k: Query parameter k, i.e., maximum number of records to return.
-overwrite: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Disributed Join

The dj operation performs a spatial join between two files. If both files are indexed, the distributed join algorithm joins every pair of overlapping partitions which is very scalable and efficient for large files. If at least one of the files is not indexed, the algorithm reduces to a simple block nested loop join which run in a distributed environment in SpatialHadoop.

If both files are indexed, the algorithm might automatically choose to repartition of the files to match the partitions used by the other file which, in some cases, might increase the performance of the operation by reducing total number of map tasks. This option can be overridden by command parameters.

  bin/shadoop dj <input1> <input2> <output> shape:<input format>
     repartition:<yest|no|auto> -overwrite

Arguments

input and output: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important.
shape:input format: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
repartition:yes|no|auto: Whether to repartition one of the files to speedup the execution or not. yes will always repartition the smaller file. no will never repartition any of the files. auto which is the default, will automatically decide whether to repartition or not based on a simple cost model.
-overwrite: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

SJMR

SJMR is the MapReduce implementation of the Partition Based Spatial Merge Join by Jignesh Patel and David DeWitt. This operation is designed to perform spatial join efficiently for non-indexed files. This algorithm partitions both files according to a uniform grid and joins every pair of matching grid cells. The dimensions of the partitioning grid are automatically determined based on input file sizes such that average partition size is equal to HDFS block size (e.g., 64MB). If one or both of the input files are skewed, the performance of this algorithm may deteriorate.

  bin/shadoop sjmr <input1> <input2> <output> shape:<input format>
     -overwrite

Arguments

input and output: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important.
shape:input format: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
-overwrite: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Minimum Bounding Rectangle

The mbr operation determines the minimal bounding rectangle (MBR) of an input file. If the file is indexed, the MBR is retrieved directly from the index. Otherwise, the whole file is scanned and the MBR is calculated as a simple aggregate function.

  bin/shadoop mbr <input> shape:<input format>

Arguments

input: path to input file. No output file is provided as the result is directly written to standard output. If the input is a directory, the output is cached by writing a master file in the same directory.
shape:input format: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.

Read file

Reads an indexed file and writes some simple information about the index.

  bin/shadoop readfile <input>

Arguments

input: path to input file. No output file is provided as the result is directly written to standard output.

Sample

Reads a random sample from an input file.

  bin/shadoop sample <input> <output>  shape:<input format>
    outshape:<output format>
    (count:<c> | size:<s> | ratio:<r>) seed:<sd>

Arguments

input and output: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be written to standard output which is useful for debugging or when the output is very small (a few records).
shape:input format: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
outshape:output format: If the output is desired to be in a different format than input, this parameter can be provided to do a simple transformation on sampled data. The two possible options are rectangle and point. If the output is rectangle, the minimal bounding rectangle (MBR) of input shape is returned. If the output is point, the center point of the MBR of each sampled record is returned.
count:c: read a random sample that contains c records
size:s: read a random sample with a maximum of s bytes
ratio:r: read a random sample with with number of records of the given ratio compared to input file
seed:sd: use the given seed for randomization. This can be useful to replicate some experiments

Spatial Generator

The operations generate is used to generate a file with spatial data of an arbitrary size. This can be very useful for benchmarking and stress tests.

  bin/shadoop generate <output> shape:<output format> mbr:<x1,y1,x2,y2>
    blocksize:<B> sindex:<grid> seed:<sd> rectsize:<rs> -overwrite

Arguments

input: path to output.
shape:output format: shapes to generate in output file. For now, the only shapes that are supported are point and rectangle.
mbr:x1,y1,x2,y2: bounding rectangle of the generated file. All generated objects must lie within this bounding box.
sindex:grid: generate the output file already indexed using a grid index. Notice that you can only specify a grid index. R-tree indexes can only be constructed after the file is generated because it depends on data distribution.
seed:sd: seed to use when generating data. This is useful to replicate some experiments using the same exact data.
rectsize:rs: maximum edge size for generated rectangles .The width and height of each generated rectangle are picked uniformly random in the range (0, rs).
-overwrite: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Polygon Union

Computes the union of a set of polygons in an input file. This operation only works for input files of type JTSShape. It takes one input file and computes the union of all polygons contained in this file.

  bin/shadoop union <input> <output> -overwrite

Arguments

input and output: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.
-overwrite: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A list of most operations in SpatialHadoop

Index

Arguments

Range query

Arguments

k Nearest Neighbor

Arguments

Disributed Join

Arguments

SJMR

Arguments

Minimum Bounding Rectangle

Arguments

Read file

Arguments

Sample

Arguments

Spatial Generator

Arguments

Polygon Union

Arguments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally