-
Notifications
You must be signed in to change notification settings - Fork 0
Hadoop FileSystem for Cloudian HyperStore
Hadoop FileSystem is an underlying filesystem used in Hadoop/Spark. HDFS is a default Hadoop FileSystem implementation. But many others are available. For example, FTPFileSystem is built for FTP, NativeAzureFileSystem is for Windows Azure Blob Storage.
There are three main functions in Hadoop FileSystem.
- Metadata
- what files/folders are included in a particular folder?
- when this file/folder was created?
- owner, permissions
- Block
- which blocks are used for a certain range of a particular file?
- which node has a particular block?
- Read/Write
- write bytes to a file
- read bytes from a file
- URI
- protocol(e.g. hdfs://)
- schema(e.g. hdfs://foo/bar => foo is a folder, and bar is a file or a folder in foo folder)
Each Hadoop FileSystem implements these functions according to their backends.
So, which Hadoop FileSystems are available to S3? As explained here, there are three types of Hadoop FileSystem available. Unless you have a particular reasons, you should use the latest one called S3AFileSystem.
Here's how FileSystem for S3 is implemented.
- Metadata
- Bucket Listing
- HEAD Object
- Block
- Not Supported
- Read/Write
- GET Object byte range
- PUT Object/Multi Part Upload
- URI
- protocol
- S3AFileSystem => s3a
- NativeS3FileSystem => s3n
- S3FileSystem => s3
- schema
- e.g. s3a://BUCKET_NAME/OBJECT_KEY
- protocol
Hadoop FileSystem for S3 does the same as you know. For example, S3AFileSystem uses Amazon SDK, while NativeS3FileSystem does jets3t library.
So, at least, you have to provide access key and secret key. Here're relevant properties for S3AFileSystem.
- fs.s3a.access.key
- fs.s3a.secret.key
You can set them in core-site.xml as a default user. For fine-grained security, you can pass them by launching your Hadoop/Spark client with -Djava.property=value. Then, the value overrides the default one in the scope of the client.
value can be replaced with an environmental variable, which is set for each logged in user.