Skip to content

Extending spatial data types in SpatialHadoop

Ahmed Eldawy edited this page May 11, 2016 · 6 revisions

Extensible data types

SpatialHadoop ships with several data types including (Point, Rectangle and Polygon). There are different cases where you'll need to extend these data types or implement new spatial data types.

  • Input files are not in the standard format used by SpatialHadoop
  • Each record contains more information than just the shape (e.g., tags or comments)
  • The application uses shapes other than the supported shapes (e.g., rounded rectangle)

This tutorial describes in details how current data types are implemented and how to extend it with more spatial data types.

Built-in data types

SpatialHadoop contains three main spatial data types, namely, Point, Rectangle and Polygon. Each data types stores just the spatial information about the shape without any extra information. All shapes are two-dimensional in the Euclidean space. A point is represented by its two dimensions (X and Y). A rectangle is represented by a corner point (X, Y) and the dimensions (Width x Height). A polygon is represented as a list of two-dimensional points.

The main storage format for spatial data types in SpatialHadoop is the text format. This makes it easier to import/export legacy formats in other applications. The standard format is a CSV format where each record is stored in one line. This format can be changed for custom data types provided that each record is stored in exactly one line. For point, a line contains two fields (X,Y) separated by a comma. For a rectangle, the tuple (X, Y, Width, Height) is stored with a comma as a separator. For polygon, each line contains number of points followed by the coordinates of each point. For example, a triangle with the corner points (0,0), (1,1) and (1,0) can be represented as "3,0,0,1,1,1,0". All coordinates used in the standard data types are long integers (64-bit).

User-defined data types

The first step is to add SpatialHadop as an external library to your project. You can follow that tutorial or use the virtual machine image that is available on the SpatialHadoop web page. The example shown on this page is already available in that virtual machine image. This example is also available on the following GitHub repository. SpatialHadoop example on GitHub.

To define your own data type, you need to define it as a new class that implements the Shape interface. For convenience, you could choose to extend one of the standard data types and built on top of it instead of building a class from scratch. For example, let's say that your files contain records represented as rectangles. Unlike the standard rectangles, each rectangle in your file has an additional ID that precedes the coordinates of the rectangle. You can extend the rectangle class and add an additional ID field.

RectangleID.java

public class RectangleID extends Rectangle {
   private int id;

The new field must also be written when an object of this class is serialized over network. This is required by SpatialHadoop (and Hadoop) when objects are transferred from mappers to reducers. This can be done as follows.

public void write(DataOutput out) throws IOException {
    out.writeInt(id);
    super.write(out);
}

public void readFields(DataInput in) throws IOException {
    id = in.readInt();
    super.readFields(in);
}

You need also to specify the format of the input file that contains objects of this type. This is done by implementing two methods fromText and toText.

The first method takes as input a Text that represents a line read from the input file, and parses it to fill the target object. The second method does the exact opposite of this. It takes a Text object and serializes the information stored in this object to this text. It should not add a terminating new line as this is added by the framework itself. The implementation of these two method will look like this.

public void fromText(Text text) {
  id = TextSerializerHelper.consumeInt(text, ',');
  super.fromText(text);
}

public Text toText(Text text) {
  TextSerializerHelper.serializeInt(id, text, ',');
  return super.toText(text);
}

Finally, you need to override the clone method in the new data type. This allows SpatialHadoop to create new instances of an object as needed by some operations (such as the spatial join operaiton).

public RectangleID clone() {
  RectangleID c = new RectangleID();
  c.id = this.id; // Set the id field
  c.set(this); // Set rectangle boundaries
  return c;
}

Once you're done with this class, you can use it with the existing operations (range query, kNN and spatial join). You need to package your class in a JAR file so that SpatialHadoop can load it. If you use Maven, you can generate the JAR file by issuing the following command:

mvn package

The JAR is typically generated under the target subdirectory in your project home. Let us assume the JAR file is called 'example-0.0.1-SNAPSHOT.jar', the command line for running a range query could be as follows.

shadoop rangequery -libjars target/example-0.0.1-SNAPSHOT.jar rectangle_id.csv rect:0,0,5000,5000 shape:edu.umn.cs.spatialHadoop.example.RectangleID

Note that you should use the fully qualified name of the class (i.e., package + class name).

You can find a sample test file on the following link. https://github.com/aseldawy/spatialhadoop-example/blob/master/src/test/resources/rectangle_id.csv

Clone this wiki locally