FileStruct (https://github.com/appcove/FileStruct) is a general purpose file server and file cache for web application servers.
The primary goal is to create a high-performance and sensible local file server for web applications. The secondary goal is to enable FileStruct to be a caching layer between an application and a storage backend (like Amazon S3).
$ mkdir /path/to/database
$ chmod 770 /path/to/database
Note that the database MUST have group write permissions.
The group of /path/to/database
will be used throughout the entire database directory. Any user who wishes to write to the database must be in this group. Permissions are checked on startup, so if the user is not a member of this group, then a ConfigError
will be raised.
$ echo '{"Version":1}' > /path/to/database/FileStruct.json
Why do we require this file? It is a safe-guard against writing into a directory accidentally. If this file does not exist, then the database client will raise a ConfigError
.
If you are running code under apache, it will by default run as the apache
user. You may need to add a group to the apache
user in order to have it access the database. Assuming the database is owned by the fileserver
group, then:
# usermod -a -G fileserver apache
may be used to add apache
to the fileserver
group.
>>> import FileStruct
>>> client = FileStruct.Client('/path/to/database')
>>> client.PutData(b'test')
'a94a8fe5ccb19ba61c4c0873d391e987982fbbd3'
>>> client['a94a8fe5ccb19ba61c4c0873d391e987982fbbd3'].GetData()
b'test'
Assuming you are user jason
and you created the /path/to/database
to have the group fileserver
, then the above will result in:
$ ls -al /path/to/database
drwxrwxr-x. 2 jason fileserver 4096 Feb 22 17:13 Data
drwxrwxr-x. 2 jason fileserver 4096 Feb 22 17:13 Error
-rw-r--r--. 1 fileserver fileserver 88 Feb 22 16:55 FileStruct.json
drwxrwxr-x. 2 jason fileserver 4096 Feb 22 17:13 Temp
drwxrwxr-x. 2 jason fileserver 4096 Feb 22 17:13 Trash
You are now ready to use FileStruct!
Point is to not read the file with application code, but offload the task to http frontend, where it's highly optimized and can be done much more efficiently.
For Nginx http daemon with the aforementioned paths and /FileStruct
internal URI (will result in 404 for client requests) to serve file contents from, following configuration can be used:
location /FileStruct/
{
internal; # MUST be used, otherwise all files are public (but with obfuscated URIs)
alias /path/to/database/Data/; # Trailing slash is important
}
Then update the application to send X-Accel-Redirect header to http daemon, instead of serving the file contents directly (simple example with bottle framework):
client = FileStruct.Client(
Path = '/path/to/database',
InternalLocation = '/FileStruct', # Specifies configured Nginx URI
)
@route('/:filename')
def download(filename):
file_id = file_ids[filename]
return HTTPResponse(headers={'X-Accel-Redirect': client[file_id].InternalURI})
Note that some daemons (e.g. lighttpd) use X-Sendfile header for such internal redirects instead.
If http frontend has no support for internal redirects at all, client redirects can still be used for efficiency, but they require additional http request round-trip and must not be used for potentially private files, as app will have no control over access to these by InternalURI.
FileStruct is designed to work with files represented by the SHA-1 hash of their contents. This means that all files in FileStruct are immutable.
FileStruct is designed as a local repository of file data accessable (read/write) by an application or web application. All operations are local I/O operations and therefore, very fast.
Where possible, streaming hash functions are used to prevent iterating over a file twice.
FileStruct is designed so that Nginx can serve files directly from it's Data directory using an X-Accel-Redirect
header. For more information on this Nginx configuration directive, see http://wiki.nginx.org/XSendfile
Assuming that nginx runs under nginx
user and file database is owned by the fileserver
group, nginx
needs to be in the fileserver
group to serve files:
# usermod -a -G fileserver nginx
FileStruct is designed to be as secure as your hosting configuration. Where possible, a dedicated user should be allocated to read/write to FileStruct, and the database directory restricted to this user.
FileStruct is designed to be incredibly simple to use.
FileStruct is designed to simplify common operations on files, especially uploaded files. Image resizing for thumbnails is supported.
FileStruct is designed to simplify the use of Temp Files in an application. The API supports creation of a temporary directory, placing files in it, Ingesting files into FileStruct, and deleting the directory when completed (or retaining it in the event of an error)
FileStruct is designed to retain files until garbage collection is performed. Garbage collection consists of telling FileStruct what files you are interested in keeping, and having it move the remaining files to the trash.
FileStruct is designed to work seamlessly with rsync for backups and restores.
At the point a file is inserted or removed from FileStruct, it is a filesystem move operation. This means that under no circumstances will a file exist in FileStruct that has contents that do not match the name of the file.
FileStruct is not designed to store MetaData. It is designed to store file content. There may be several "files" which refer to the same content. empty.log
, empty.txt
, and empty.ini
may all refer to the empty file Data/da/39/da39a3ee5e6b4b0d3255bfef95601890afd80709
. However, this file will be retained as long as any aspect of the application still uses it.
Because file content is stored in files with the hash of the content, automatic file-level de-duplication occurs. When a file is pushed to FileStruct that already exists, there is no need to write it again.
This carries the distinct benifit of being able to use the same FileStruct database across multiple projects if desired, because the content of file Data/da/39/da39a3ee5e6b4b0d3255bfef95601890afd80709
is always the same, regardless of the application that placed it there.
Note: In the event that multiple instances or applications use the same database, the garbage collection routine MUST take all references to a given hash into account, across all applications that use the database. Otherwise, it would be easy to delete data that should be retained.
The database should be placed in a secure directory that only the owner of the application can read and write to.
SECURITY NOTE: mod_wsgi runs by default as the apache user. It can be configured to run as a different user. We recommend a dedicated user to run the application and access FileStruct files.
/path/to/app/database
FileStruct.json
contains a JSON value like {"Version": 1, "User": "MyApp", "Group": "MyApp"}
Data
{00-ff}
{00-ff}
[0-9a-f]{40}
da
39
da39a3ee5e6b4b0d3255bfef95601890afd80709
da3968daecd823babbb58edb1c8e14d7106e83bb
...
Error
20130220151717-86718750-24386270
Python-Exception.txt
file1.whatever
yourfile.txt
...
Temp
20130220151713-62109375-21427441
upload.jpg
resize.jpg
...
Trash
20130220164717-46718750-24343534
da39a3ee5e6b4b0d3255bfef95601890afd80709
77de68daecd823babbb58edb1c8e14d7106e83bb
f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59
35139ef894b28b73bea022755166a23933c7d9cb
...
In order for the FileStruct Client to operate, the FileStruct.json file must be present and readable. If any of the above top-level directories are missing, they will be automatically created by FileStruct.
Each file placed in the database/Data
directory will have write permissions removed. This is to hopefully prevent accidental modification of immutable data in the database.
Each time a FileStruct.Client
object is created, the FileStruct.json
file is loaded. The contents of this file are a simple JSON string.
Note: Lines beginning with # are ignored.
{
"Version": 1,
"User": "MyApp",
"Group": "MyApp"
}
For future adjustments to the database format. Currently must be 1
.
The user that "owns" the database. Can be an integer UID or string Username.
"User": 500
and "User": "MyApp"
are both valid.
The primary group that "owns" the database. Can be an integer UID or string Username.
"User": 500
and "User": "MyApp"
are both valid.
Import FileStruct
and create an instance of the Client
class. This operation will open FileStruct.json
, verify it's contents, and check for the existence of several directories. Therefore it is best to create a aingle instance and re-use it.
FileStruct.Client
instances methods are THREAD SAFE.
import FileStruct
client = FileStruct.Client(
Path = '/home/myapp/filestruct',
InternalLocation = '/FileStruct',
)
Returns True if the specified hash exists in the database. Otherwise returns False. Improperly formed hashes do not raise an error in this method.
Returns a FileStruct.HashFile
object or raises a KeyError
. See Working with Files for more information.
Fully qualified filesystem path to the database.
Full path to ImageMagick convert binary. Defaults to /usr/bin/convert
.
Return a FileStruct.TempDir
object (context manager) which will create a temporary directory and (typically) remove it when finished. See Working with Files for more information.
with client.TempDir() as TempDir:
open(TempDir['data.dat'].Path, 'wb').write(mydata)
hash = TempDir['data.dat'].Save()
Reads all data from stream
, which must be an object with a .read()
interface, returning bytes. Does not attempt to rewind first, so make sure the stream is "ready to read". Places the file in the database and returns the hash.
Takes a bytes
object and saves it to the database. Returns the hash.
Takes the path to a file. Reads the file into the database. Does not modify the original file. Returns the hash.
Returns True if the specified hash exists in the database. Otherwise returns False. Improperly formed hashes do not raise an error in this method.
Future Note: in the event that FileStruct has a remote back-end, like Amazon S3, this could be a resource-intensive operation.
Returns a FileStruct.HashFile
object which wraps a file in the database. If the file does not exist, a KeyError
is raised.
Returns the full path to this temporary file (regardless of existence)
The 40 character hash.
The full filesystem path to the hash file in the database. This is for READ ONLY purposes. Because the process calling this code has authority to write to this file, the database could be corrupted if this path is written to in any way.
Opens the hash file in the database for reading (bytes). Because this is a pass through to open()
, it can be used as a context manager (with
statement).
Reads the entire file into memory as a bytes
object
Warning: do not use this with large files.
Returns an internal URI suitable for passing back to a front-end webserver, such as nginx. Joins the client.InternalLocation
with the rest of the database/Data/...
path to produce a URL that can be used with X-Accel-Redirect
.
headers.add_header('Content-type', 'image/jpeg')
headers.add_header('X-Accel-Redirect', client[hash].InternalURI)
Example nginx configuration snippet:
location ^~ /FileStruct/
{
internal; # MUST be used, otherwise all files are public (but with obfuscated URIs)
alias /path/to/database/Data/; # Trailing slash is important
}
Example return:
>>> client = FileServer.Client(Path, InternalLocation='/FileStruct')
>>> client['da39a3ee5e6b4b0d3255bfef95601890afd80709'].InternalURI
'/FileStruct/da/39/da39a3ee5e6b4b0d3255bfef95601890afd80709'
More info on XSendFile here: http://wiki.nginx.org/XSendfile
Return a context manager which will create a temporary directory and (typically) remove it when the context manager exits. For example:
with client.TempDir() as TempDir:
open(TempDir.FilePath('upload.jpg'), 'wb').write(mydata)
TempDir.ResizeImage('upload.jpg', 'resize.jpg', '100x100')
hash1 = TempDir.Save('upload.jpg')
hash2 = TempDir.Save('resize.jpg')
When the context manager is entered, the directory is:
- Created in
database/Temp/...
- The directory is named
YYYYMMDDhhmmss.fraction.randomnn
- For example:
database/Temp/20130220154544-39453125-17036182
When the context manager is exited, the directory is:
- Removed (default action)
- Moved to database/Error (in the event
TempDir.Retain == True
) - Moved to
database/Error
withPython-Exception.txt
written (in the event of an exception)
A reference to the FileStruct.Client
object that created this FileStruct.TempDir
object.
The full path of the temporary directory.
Defaults to False
. Set to True
to cause the temporary directory to be moved to the database/Error
directory when the context manager exits (e.g. end of with
statement).
Returns a TempFile object with the name specified in filename
.
filename
is restricted to the following: [a-zA-Z0-9_.+-]{1,255}
Create a symbolic link in the temporary directory to the specified hash file in the database. This is useful for obtaining access to files for subsequent operations, like an image resize.
Opens the temporary file for reading (bytes). Because this is a pass through to open()
, it can be used as a context manager (with
statement).
Reads the entire temporary file into memory as a bytes
object
Warning: do not use this with large files.
Opens filename
in the temporary directory for writing, and writes the entire contents of stream
to it.
Opens filename
in the temporary directory for writing and writes the entire contents of data
to it. data
must be bytes
.
Opens filename
in the temporary directory for writing and writes the entire contents of file
to it.
Calculates the hash of this file and then moves it into the database. Returns the 40 character hash. This moves the file, so it will no longer exist in the temporary directory.
This will take a file suitable for input to ImageMagick convert and both resize and normalize it to the specified pixel dimensions. Smaller images will be enlarged, and from the center of the image will be taken an image of pixel_width by pixel_height. This is most useful for profile pictures as illustrated in the following code sample:
with App.FS.TempDir() as TD:
TD['upload'].PutStream(self.File['Profile_Pic'].Stream)
TD.convert_normalize('upload', 'large.jpg', 512, 512)
TD.convert_normalize('large.jpg', 'small.jpg', 128, 128)
hash1 = TD['large.jpg'].Ingest()
hash2 = TD['small.jpg'].Ingest()
This is similar to convert_normalize
except it takes an imagemagick size specification as the third argument. See documentation here
For example:
'64x64<'
will only shrink an image if it is larger.
'100x'
will resize the image to 100px wide.
vim:fileencoding=utf-8:ts=2:sw=2:expandtab