Skip to content

Hadoop FileSystem for Cloudian HyperStore

Takenori Sato edited this page Jan 14, 2016 · 4 revisions

Hadoop FileSystem

Hadoop FileSystem is an underlying filesystem used in Hadoop/Spark. HDFS is a default Hadoop FileSystem implementation. But many others are available. For example, FTPFileSystem is built for FTP, NativeAzureFileSystem is for Windows Azure Blob Storage.

There are four main functions in Hadoop FileSystem.

  1. Metadata
    • what files/folders are included in a particular folder?
    • when this file/folder was created?
    • owner, permissions
  2. Block
    • which blocks are used for a certain range of a particular file?
    • which node has a particular block?
  3. Read/Write
    • write bytes to a file
    • read bytes from a file
  4. URI
    • protocol(e.g. hdfs://)
    • schema(e.g. hdfs://foo/bar => foo is a folder, and bar is a file or a folder in foo folder)

Each Hadoop FileSystem implements these functions according to their backends.

Hadoop FileSystem for S3

So, which Hadoop FileSystems are available to S3? As explained here, there are three types of Hadoop FileSystem available. Unless you have a particular reasons, you should use the latest one called S3AFileSystem.

Here's how FileSystem for S3 is implemented.

  1. Metadata
    • Bucket Listing
    • HEAD Object
  2. Block
    • Not Supported
  3. Read/Write
    • GET Object byte range
    • PUT Object/Multi Part Upload
  4. URI
    • protocol
      • S3AFileSystem => s3a
      • NativeS3FileSystem => s3n
      • S3FileSystem => s3
    • schema
      • e.g. s3a://BUCKET_NAME/OBJECT_KEY

Security in Hadoop FileSystem for S3

Hadoop FileSystem for S3 does the same as you know. For example, S3AFileSystem uses Amazon SDK, while NativeS3FileSystem does jets3t library.

So, at least, you have to provide access key and secret key. Here're relevant properties for S3AFileSystem.

  • fs.s3a.access.key
  • fs.s3a.secret.key

You can set them in core-site.xml as a default user. For fine-grained security, you can pass them by launching your Hadoop/Spark client with -Djava.property=value. Then, the value overrides the default one in the scope of the client.

value can be replaced with an environmental variable, which is set for each logged in user.

Hadoop FileSystem for Cloudian HyperStore

Cloudian HyperStore guarantees 100% compatibility with Amazon S3. So this is not an exception. You can access your objects in Cloudian HyperStore with any Hadoop FileSystem for S3, which are s3a, s3n, and s3.

Here're example properties in core-site.xml.

<property>
  <name>fs.s3a.access.key</name>
  <value>YOUR_ACCESS_KEY</value>
  <description>AWS access key ID. Omit for Role-based authentication.</description>
</property>

<property>
  <name>fs.s3a.secret.key</name>
  <value>YOUR_SECRET_KEY</value>
  <description>AWS secret key. Omit for Role-based authentication.</description>
</property>

<property>
  <name>fs.s3a.connection.ssl.enabled</name>
  <value>true</value>
  <description>Enables or disables SSL connections to S3.</description>
</property>

<property>
  <name>fs.s3a.endpoint</name>
  <value>s3-region1.cloudian.com:4443</value>
  <description>AWS S3 endpoint to connect to. An up-to-date list is
    provided in the AWS Documentation: regions and endpoints. Without this
    property, the standard region (s3.amazonaws.com) is assumed.
  </description>
</property>

And in spark-defaults.conf.

spark.hadoop.fs.s3a.access.key   ACCESS_KEY  
spark.hadoop.fs.s3a.secret.key   SECRET_KEY  
spark.hadoop.fs.s3a.connection.ssl.enabled   true  
spark.hadoop.fs.s3a.endpoint     s3-region1.cloudian.com:4443 
spark.hadoop.fs.hsfs.impl        com.cloudian.hadoop.HyperStoreFileSystem  

Data Locality in Cloudian HyperStore

You might have noticed, but Block operations are not supported in S3. This is because S3 is in cloud, where no physical locations are exposed to clients. You never know the location of your object stored in S3.

But, as explained in Data Locality, you will need this as your cluster gets bigger.

To enable Data Locality in S3, we wrote our own Hadoop FileSystem for Cloudian HyperStore(hsfs).The main purpose of hsfs is to minimize network transfers by delivering an analytical job to a node that actually stores its data.

To make this possible, Hadoop filesystem allows its subclass to return a block size for a particular path(file), and a list of nodes that stores a particular chunk(Block operations).

With those information, an analytical task is split into multiple analytical jobs, each of which is delivered to a node, then finds its data locally.

In order to make this work in Cloudian HyperStore, there are some rules you need to care.

method size data locality
PUT <=MOCS YES
PUT >MOCS YES
MPU <MOCS NO
MPU =MOCS YES
MPU >MOCS NO
  • MOCS: Max Object Chunk Size
  • MPU: Multi Part Upload

As above, all the PUT requests will be correctly handled because they are chunked by MOCS. While MPU doesn't work unless its part size is equal to MOCS.

Also note that Data Locality is not avaialble when a chunk can not be readable in the following cases.

  1. Erasure Coding
  2. Server Side Encryption

Clone this wiki locally