Backend Stores, Object Cache and Disk Cache

Introduction

Every node of Sheepdog cluster has a backend store that provides weighted storage for Sheepdog cluster to store objects (For e.g, VDI and data objects), whose location are determined by consistent hashing algorithm.

Object cache caches data and VDI objects on the local node which runs a sheep daemon. It is at higher level than backend store. This extra cache layer translates gateway requests (from VM) into local requests, largely reducing the network traffic and highly improving the IO performance, at the expense of data inconsistency between objects in object cache and backend store. These dirty objects will be flushed to cluster storage by 'sync' request from guest OS.

Object cache supports cache quota. As explained below, it can be specified with command line option.

In previous sheepdog, writeback or writethrough of object cache could be specified with command line option. But in current sheepdog, writeback or writethrough is determined by request from QEMU.

If you run QEMU without a local sheep daemon, you need be aware that objects won't be cached at local node, instead will be cached at the node QEMU is remotely connected to.

Let's put it all together from the perspective of the request from VM (with object cache enabled):

IO req (from VM) <--> object cache on local node <--> gateway req <--> targeted sheep backend store

Farm

Currently Sheepdog supports two kinds of backend store, one is Simple store (removed) and the other called Farm (as of now, default).

Simple store, as the name suggests, it simply translate requests into system calls to store data into files backed by local file systems.

Farm is yet another backend store with much more code lines, which is supposed to provide advanced features such as cluster-wide snapshot, faster recovery, better stale object handling, data de-duplication(not implemented yet) and so on, and tries to keep comparable IO performance as Simple store.

Sheepdog user from Taobao.com uses Farm as its default store since its inception and would continue to tune it into better performance and add more features. You can also request any feature you think of proper in the mailing list.

Cluster snapshot

Object cache and disk cache

Generally speaking, we support writeback, which caches write update for a period of time and flushes dirty bits by a 'sync' request from Guest OS, routed by QEMU and writethrough, which means we don't need care about cache and backend consistency. In writethrough mode, in fact, QEMU won't issue 'sync' request at all.

There are currently two cache supported by Sheepdog, one is object cache and another is disk cache. They are operating in different layers, object cache can be thought as local cache while disk cache control how we write to the every host disk. If disk cache is enabled, sheeps open() their object without O_DSYNC flag at backend, this means we take advantage of write cache built in the hard drive.

Disk cache improves the performance of sheepdog, but you must be careful when you use it.

Because if the disk cache is enabled, write requests to sheeps don't mean storing data to persistent storage. So explicit disk cache flushing is required when guest OSes issue sync request to their block device of QEMU. This sync request causes performance degradation. The degradation is serious especially in an environment which contains lots of VMs.

The performance degradation is workload specific. So you should evaluate the effect of disk cache well. Object cache support both writeback and writhrough semantics. For writeback mode, you'll enjoy near native local image performance and for writethrough mode, we can also enjoy performance boost for read requests. (You can think it as read only cache)

We can control which mode object/disk cache is on by manipulating the 'cache' option in QEMU command. For example,

$ qemu-system-x86_64 --enable-kvm -m 1024 -drive file=sheepdog:test,cache=writeback

enables writeback mode for the dedicated virtual disk image (VDI) named "test" and

$ qemu-system-x86_64 --enable-kvm -m 1024 -drive file=sheepdog:test

or

$ qemu-system-x86_64 --enable-kvm -m 1024 -drive file=sheepdog:test,cache=writethrough

enables writethrough mode.

Both object cache and disk cache is disabled by default in Sheepdog. '-w' is used for enabling object cache in local node and disk cache's writeback semantics in backend stores. Example of '-w' is like this:

    -w disk  # enable writeback cache semantics of disks
    -w disk,object:size=50 # enable writeback cache semantics of disks, and enable object cache with 50MB memory
    -w object:size=50 # enable object cache with 50MB memory
    -w object:size=50:directio # enable object cache with 50MB memory with O_DIRECT for cached objects

To modify the max object cache size:

$ collie node cache 300

NOTE ‘max cache size' is a hint to Sheepdog that when the object cache size reaches specified 'max cache size', it begins to do reclaiming, that tries to shrink the cache size to a lower watermark. So probably for some corner cases, you might have object cache more than specified max size, when the rate of reclaiming is lower than the new object created by on the fly IO requests.

There are some more options to do finer control over how object cache does read/write internally

As default, object cache layer tries to utilize page cache (memory cache) as much as possible, so if you want a more durable cache, you can specify '-w object:directio' option. This means we don't use kernel's page cache to further cache data before it reaching to disk, and thus those data can survive the host OS crash.

Snapshot and Convert

Since qemu-img use 'writeback ' or 'unsafe' mode as its default option, with object cached added into Sheepdog, we should pass explicitly a cache control option to stop it from doing anything wrong.

To convert an image to VDI 'test':

$ qemu-img convert -t writethrough linux-0.2.img sheepdog:test

To snapshot an image named 'test' tagged as 'snap':

$ collie vdi snapshot -s snap test

Snapshot images can be used as base images for cloned VMs. To clone an existing snapshot 'test' as 'cloned_vm':

$ collie vdi clone -s snap test cloned_vm

Cloned VMs are implemented by Copy On Write semantics in the Sheepdog cluster, this means cloning operation is very fast and storage-wise cheap, those cloned VMs will share as much as possible data objects from base image. Sheepdog also supports tree structured cloning: you can snapshot the cloned VM and have it as a new base and so on.

Provide feedback

Saved searches