-
Notifications
You must be signed in to change notification settings - Fork 99
A list of most operations in SpatialHadoop
SpatialHadoop ships with many spatial operations that work efficiently on a cluster and scales up to terabytes of data. This page describes these operations with all possible parameters and customizations. You can list down all supported operations by types bin/shadoop. You can also get a quick description and arguments for a specific operation by typing bin/shadoop <operation>
The index command is used to construct a spatial index for a given input file. Currently, there are three spatial indexes which can be constructed in SpatialHadoop, namely, Grid, R-tree and R+-tree.
A grid index partitions the data according to a uniform grid based on the input file size. For uniform data, this index is expected to partition the input file, into equally sized partition, each of size 64MB. If an input record (e.g., rectangle) overlaps multiple partitions, it is replicated to all overlapping partitions. R-tree and R+-tree are both used for skewed input data where a uniform grid does not work properly. In R-tree, each input record is stored in only one partition. The boundaries of each partition may need to expand to enclose all contained records which means that partitions might overlap. On the other side, in R+-tree, partitions are kept disjoint while records are replicated to all overlapping partitions.
bin/shadoop index <input> <output> shape:<input format>
sindex:<index> blocksize:<size> -overwrite
-
input
andoutput
: Path to input and output files -
shape:input format
: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype. -
sindex:index
: Type of index to construct, eithergrid
,rtree
orr+tree
. -
blocksize:size
: Block size for generated file. By default, it uses the default block size in the output directory. -
-overwrite
: If provided, the output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.
The rangequery command is used to run a spatial range query for an input file. If the input file is indexed, the index will be accessed to speed up the execution of the query. Otherwise, the whole file will be scanned to retrieve matching records.
bin/shadoop rangequery <input> <output> shape:<input format>
rect:<rectangle> -overwrite
-
input
andoutput
: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important. -
shape:input format
: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype. -
rect:rectangle
: Dimensions of the query rectangle in the format x1,y1,x2,y2 where (x1,y1) is the coordinate of the minimum corner of the query ractangle while (x2,y2) is the coordinate of the maximum corner of it. -
-overwrite
: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.
he knn operation is used to run a k nearest neighbor query for an input file. Given a query point Q, this operation returns the k closest records to Q. If the input file is indexed, the index will be accessed to speed up the execution of the query. Otherwise, the whole file will be scanned to retrieve correct answer.
bin/shadoop knn <input> <output> shape:<input format>
point:<x,y> k:<k> -overwrite
-
input
andoutput
: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important. -
shape:input format
: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype. -
point:x,y
: Coordinates of the query point. -
k:k
: Query parameter k, i.e., maximum number of records to return. -
-overwrite
: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.
The dj operation performs a spatial join between two files. If both files are indexed, the distributed join algorithm joins every pair of overlapping partitions which is very scalable and efficient for large files. If at least one of the files is not indexed, the algorithm reduces to a simple block nested loop join which run in a distributed environment in SpatialHadoop.
If both files are indexed, the algorithm might automatically choose to repartition of the files to match the partitions used by the other file which, in some cases, might increase the performance of the operation by reducing total number of map tasks. This option can be overridden by command parameters.
bin/shadoop dj <input1> <input2> <output> shape:<input format>
repartition:<yest|no|auto> -overwrite
-
input
andoutput
: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important. -
shape:input format
: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype. -
repartition:yes|no|auto
: Whether to repartition one of the files to speedup the execution or not. yes will always repartition the smaller file. no will never repartition any of the files. auto which is the default, will automatically decide whether to repartition or not based on a simple cost model. -
-overwrite
: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.
SJMR is the MapReduce implementation of the Partition Based Spatial Merge Join by Jignesh Patel and David DeWitt. This operation is designed to perform spatial join efficiently for non-indexed files. This algorithm partitions both files according to a uniform grid and joins every pair of matching grid cells. The dimensions of the partitioning grid are automatically determined based on input file sizes such that average partition size is equal to HDFS block size (e.g., 64MB). If one or both of the input files are skewed, the performance of this algorithm may deteriorate.
bin/shadoop sjmr <input1> <input2> <output> shape:<input format>
-overwrite
-
input
andoutput
: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important. -
shape:input format
: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype. -
-overwrite
: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.
The mbr operation determines the minimal bounding rectangle (MBR) of an input file. If the file is indexed, the MBR is retrieved directly from the index. Otherwise, the whole file is scanned and the MBR is calculated as a simple aggregate function.
bin/shadoop mbr <input> shape:<input format>
-
input
: path to input file. No output file is provided as the result is directly written to standard output. If the input is a directory, the output is cached by writing a master file in the same directory. -
shape:input format
: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
Reads an indexed file and writes some simple information about the index.
bin/shadoop readfile <input>
-
input
: path to input file. No output file is provided as the result is directly written to standard output.
Reads a random sample from an input file.
bin/shadoop sample <input> <output> shape:<input format>
outshape:<output format>
(count:<c> | size:<s> | ratio:<r>) seed:<sd>
-
input
andoutput
: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be written to standard output which is useful for debugging or when the output is very small (a few records). -
shape:input format
: Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype. -
outshape:output format
: If the output is desired to be in a different format than input, this parameter can be provided to do a simple transformation on sampled data. The two possible options are rectangle and point. If the output is rectangle, the minimal bounding rectangle (MBR) of input shape is returned. If the output is point, the center point of the MBR of each sampled record is returned. -
count:c
: read a random sample that contains c records -
size:s
: read a random sample with a maximum of s bytes -
ratio:r
: read a random sample with with number of records of the given ratio compared to input file -
seed:sd
: use the given seed for randomization. This can be useful to replicate some experiments
The operations generate is used to generate a file with spatial data of an arbitrary size. This can be very useful for benchmarking and stress tests.
bin/shadoop generate <output> shape:<output format> mbr:<x1,y1,x2,y2>
blocksize:<B> sindex:<grid> seed:<sd> rectsize:<rs> -overwrite
-
input
: path to output. -
shape:output format
: shapes to generate in output file. For now, the only shapes that are supported are point and rectangle. -
mbr:x1,y1,x2,y2
: bounding rectangle of the generated file. All generated objects must lie within this bounding box. -
sindex:grid
: generate the output file already indexed using a grid index. Notice that you can only specify a grid index. R-tree indexes can only be constructed after the file is generated because it depends on data distribution. -
seed:sd
: seed to use when generating data. This is useful to replicate some experiments using the same exact data. -
rectsize:rs
: maximum edge size for generated rectangles .The width and height of each generated rectangle are picked uniformly random in the range (0, rs). -
-overwrite
: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.
Computes the union of a set of polygons in an input file. This operation only works for input files of type JTSShape. It takes one input file and computes the union of all polygons contained in this file.
bin/shadoop union <input> <output> -overwrite
-
input
andoutput
: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. -
-overwrite
: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.