Commit cafe5635 authored by Kent Overstreet's avatar Kent Overstreet

bcache: A block layer cache

Does writethrough and writeback caching, handles unclean shutdown, and
has a bunch of other nifty features motivated by real world usage.

See the wiki at http://bcache.evilpiepirate.org for more.
Signed-off-by: 's avatarKent Overstreet <koverstreet@google.com>
parent ea6749c7
What: /sys/block/<disk>/bcache/unregister
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
A write to this file causes the backing device or cache to be
unregistered. If a backing device had dirty data in the cache,
writeback mode is automatically disabled and all dirty data is
flushed before the device is unregistered. Caches unregister
all associated backing devices before unregistering themselves.
What: /sys/block/<disk>/bcache/clear_stats
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
Writing to this file resets all the statistics for the device.
What: /sys/block/<disk>/bcache/cache
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a backing device that has cache, a symlink to
the bcache/ dir of that cache.
What: /sys/block/<disk>/bcache/cache_hits
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: integer number of full cache hits,
counted per bio. A partial cache hit counts as a miss.
What: /sys/block/<disk>/bcache/cache_misses
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: integer number of cache misses.
What: /sys/block/<disk>/bcache/cache_hit_ratio
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: cache hits as a percentage.
What: /sys/block/<disk>/bcache/sequential_cutoff
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: Threshold past which sequential IO will
skip the cache. Read and written as bytes in human readable
units (i.e. echo 10M > sequntial_cutoff).
What: /sys/block/<disk>/bcache/bypassed
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
Sum of all reads and writes that have bypassed the cache (due
to the sequential cutoff). Expressed as bytes in human
readable units.
What: /sys/block/<disk>/bcache/writeback
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: When on, writeback caching is enabled and
writes will be buffered in the cache. When off, caching is in
writethrough mode; reads and writes will be added to the
cache but no write buffering will take place.
What: /sys/block/<disk>/bcache/writeback_running
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: when off, dirty data will not be written
from the cache to the backing device. The cache will still be
used to buffer writes until it is mostly full, at which point
writes transparently revert to writethrough mode. Intended only
for benchmarking/testing.
What: /sys/block/<disk>/bcache/writeback_delay
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: In writeback mode, when dirty data is
written to the cache and the cache held no dirty data for that
backing device, writeback from cache to backing device starts
after this delay, expressed as an integer number of seconds.
What: /sys/block/<disk>/bcache/writeback_percent
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: If nonzero, writeback from cache to
backing device only takes place when more than this percentage
of the cache is used, allowing more write coalescing to take
place and reducing total number of writes sent to the backing
device. Integer between 0 and 40.
What: /sys/block/<disk>/bcache/synchronous
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, a boolean that allows synchronous mode to be
switched on and off. In synchronous mode all writes are ordered
such that the cache can reliably recover from unclean shutdown;
if disabled bcache will not generally wait for writes to
complete but if the cache is not shut down cleanly all data
will be discarded from the cache. Should not be turned off with
writeback caching enabled.
What: /sys/block/<disk>/bcache/discard
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, a boolean allowing discard/TRIM to be turned off
or back on if the device supports it.
What: /sys/block/<disk>/bcache/bucket_size
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, bucket size in human readable units, as set at
cache creation time; should match the erase block size of the
SSD for optimal performance.
What: /sys/block/<disk>/bcache/nbuckets
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, the number of usable buckets.
What: /sys/block/<disk>/bcache/tree_depth
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, height of the btree excluding leaf nodes (i.e. a
one node tree will have a depth of 0).
What: /sys/block/<disk>/bcache/btree_cache_size
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
Number of btree buckets/nodes that are currently cached in
memory; cache dynamically grows and shrinks in response to
memory pressure from the rest of the system.
What: /sys/block/<disk>/bcache/written
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, total amount of data in human readable units
written to the cache, excluding all metadata.
What: /sys/block/<disk>/bcache/btree_written
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, sum of all btree writes in human readable units.
This diff is collapsed.
......@@ -1616,6 +1616,13 @@ W: http://www.baycom.org/~tom/ham/ham.html
S: Maintained
F: drivers/net/hamradio/baycom*
BCACHE (BLOCK LAYER CACHE)
M: Kent Overstreet <koverstreet@google.com>
L: linux-bcache@vger.kernel.org
W: http://bcache.evilpiepirate.org
S: Maintained:
F: drivers/md/bcache/
BEFS FILE SYSTEM
S: Orphan
F: Documentation/filesystems/befs.txt
......
......@@ -174,6 +174,8 @@ config MD_FAULTY
In unsure, say N.
source "drivers/md/bcache/Kconfig"
config BLK_DEV_DM
tristate "Device mapper support"
---help---
......
......@@ -29,6 +29,7 @@ obj-$(CONFIG_MD_RAID10) += raid10.o
obj-$(CONFIG_MD_RAID456) += raid456.o
obj-$(CONFIG_MD_MULTIPATH) += multipath.o
obj-$(CONFIG_MD_FAULTY) += faulty.o
obj-$(CONFIG_BCACHE) += bcache/
obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
......
config BCACHE
tristate "Block device as cache"
select CLOSURES
---help---
Allows a block device to be used as cache for other devices; uses
a btree for indexing and the layout is optimized for SSDs.
See Documentation/bcache.txt for details.
config BCACHE_DEBUG
bool "Bcache debugging"
depends on BCACHE
---help---
Don't select this option unless you're a developer
Enables extra debugging tools (primarily a fuzz tester)
config BCACHE_EDEBUG
bool "Extended runtime checks"
depends on BCACHE
---help---
Don't select this option unless you're a developer
Enables extra runtime checks which significantly affect performance
config BCACHE_CLOSURES_DEBUG
bool "Debug closures"
depends on BCACHE
select DEBUG_FS
---help---
Keeps all active closures in a linked list and provides a debugfs
interface to list them, which makes it possible to see asynchronous
operations that get stuck.
# cgroup code needs to be updated:
#
#config CGROUP_BCACHE
# bool "Cgroup controls for bcache"
# depends on BCACHE && BLK_CGROUP
# ---help---
# TODO
obj-$(CONFIG_BCACHE) += bcache.o
bcache-y := alloc.o btree.o bset.o io.o journal.o writeback.o\
movinggc.o request.o super.o sysfs.o debug.o util.o trace.o stats.o closure.o
CFLAGS_request.o += -Iblock
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
/*
* Asynchronous refcounty things
*
* Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
* Copyright 2012 Google, Inc.
*/
#include <linux/debugfs.h>
#include <linux/module.h>
#include <linux/seq_file.h>
#include "closure.h"
void closure_queue(struct closure *cl)
{
struct workqueue_struct *wq = cl->wq;
if (wq) {
INIT_WORK(&cl->work, cl->work.func);
BUG_ON(!queue_work(wq, &cl->work));
} else
cl->fn(cl);
}
EXPORT_SYMBOL_GPL(closure_queue);
#define CL_FIELD(type, field) \
case TYPE_ ## type: \
return &container_of(cl, struct type, cl)->field
static struct closure_waitlist *closure_waitlist(struct closure *cl)
{
switch (cl->type) {
CL_FIELD(closure_with_waitlist, wait);
CL_FIELD(closure_with_waitlist_and_timer, wait);
default:
return NULL;
}
}
static struct timer_list *closure_timer(struct closure *cl)
{
switch (cl->type) {
CL_FIELD(closure_with_timer, timer);
CL_FIELD(closure_with_waitlist_and_timer, timer);
default:
return NULL;
}
}
static inline void closure_put_after_sub(struct closure *cl, int flags)
{
int r = flags & CLOSURE_REMAINING_MASK;
BUG_ON(flags & CLOSURE_GUARD_MASK);
BUG_ON(!r && (flags & ~(CLOSURE_DESTRUCTOR|CLOSURE_BLOCKING)));
/* Must deliver precisely one wakeup */
if (r == 1 && (flags & CLOSURE_SLEEPING))
wake_up_process(cl->task);
if (!r) {
if (cl->fn && !(flags & CLOSURE_DESTRUCTOR)) {
/* CLOSURE_BLOCKING might be set - clear it */
atomic_set(&cl->remaining,
CLOSURE_REMAINING_INITIALIZER);
closure_queue(cl);
} else {
struct closure *parent = cl->parent;
struct closure_waitlist *wait = closure_waitlist(cl);
closure_debug_destroy(cl);
atomic_set(&cl->remaining, -1);
if (wait)
closure_wake_up(wait);
if (cl->fn)
cl->fn(cl);
if (parent)
closure_put(parent);
}
}
}
/* For clearing flags with the same atomic op as a put */
void closure_sub(struct closure *cl, int v)
{
closure_put_after_sub(cl, atomic_sub_return(v, &cl->remaining));
}
EXPORT_SYMBOL_GPL(closure_sub);
void closure_put(struct closure *cl)
{
closure_put_after_sub(cl, atomic_dec_return(&cl->remaining));
}
EXPORT_SYMBOL_GPL(closure_put);
static void set_waiting(struct closure *cl, unsigned long f)
{
#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
cl->waiting_on = f;
#endif
}
void __closure_wake_up(struct closure_waitlist *wait_list)
{
struct llist_node *list;
struct closure *cl;
struct llist_node *reverse = NULL;
list = llist_del_all(&wait_list->list);
/* We first reverse the list to preserve FIFO ordering and fairness */
while (list) {
struct llist_node *t = list;
list = llist_next(list);
t->next = reverse;
reverse = t;
}
/* Then do the wakeups */
while (reverse) {
cl = container_of(reverse, struct closure, list);
reverse = llist_next(reverse);
set_waiting(cl, 0);
closure_sub(cl, CLOSURE_WAITING + 1);
}
}
EXPORT_SYMBOL_GPL(__closure_wake_up);
bool closure_wait(struct closure_waitlist *list, struct closure *cl)
{
if (atomic_read(&cl->remaining) & CLOSURE_WAITING)
return false;
set_waiting(cl, _RET_IP_);
atomic_add(CLOSURE_WAITING + 1, &cl->remaining);
llist_add(&cl->list, &list->list);
return true;
}
EXPORT_SYMBOL_GPL(closure_wait);
/**
* closure_sync() - sleep until a closure a closure has nothing left to wait on
*
* Sleeps until the refcount hits 1 - the thread that's running the closure owns
* the last refcount.
*/
void closure_sync(struct closure *cl)
{
while (1) {
__closure_start_sleep(cl);
closure_set_ret_ip(cl);
if ((atomic_read(&cl->remaining) &
CLOSURE_REMAINING_MASK) == 1)
break;
schedule();
}
__closure_end_sleep(cl);
}
EXPORT_SYMBOL_GPL(closure_sync);
/**
* closure_trylock() - try to acquire the closure, without waiting
* @cl: closure to lock
*
* Returns true if the closure was succesfully locked.
*/
bool closure_trylock(struct closure *cl, struct closure *parent)
{
if (atomic_cmpxchg(&cl->remaining, -1,
CLOSURE_REMAINING_INITIALIZER) != -1)
return false;
closure_set_ret_ip(cl);
smp_mb();
cl->parent = parent;
if (parent)
closure_get(parent);
closure_debug_create(cl);
return true;
}
EXPORT_SYMBOL_GPL(closure_trylock);
void __closure_lock(struct closure *cl, struct closure *parent,
struct closure_waitlist *wait_list)
{
struct closure wait;
closure_init_stack(&wait);
while (1) {
if (closure_trylock(cl, parent))
return;
closure_wait_event_sync(wait_list, &wait,
atomic_read(&cl->remaining) == -1);
}
}
EXPORT_SYMBOL_GPL(__closure_lock);
static void closure_delay_timer_fn(unsigned long data)
{
struct closure *cl = (struct closure *) data;
closure_sub(cl, CLOSURE_TIMER + 1);
}
void do_closure_timer_init(struct closure *cl)
{
struct timer_list *timer = closure_timer(cl);
init_timer(timer);
timer->data = (unsigned long) cl;
timer->function = closure_delay_timer_fn;
}
EXPORT_SYMBOL_GPL(do_closure_timer_init);
bool __closure_delay(struct closure *cl, unsigned long delay,
struct timer_list *timer)
{
if (atomic_read(&cl->remaining) & CLOSURE_TIMER)
return false;
BUG_ON(timer_pending(timer));
timer->expires = jiffies + delay;
atomic_add(CLOSURE_TIMER + 1, &cl->remaining);
add_timer(timer);
return true;
}
EXPORT_SYMBOL_GPL(__closure_delay);
void __closure_flush(struct closure *cl, struct timer_list *timer)
{
if (del_timer(timer))
closure_sub(cl, CLOSURE_TIMER + 1);
}
EXPORT_SYMBOL_GPL(__closure_flush);
void __closure_flush_sync(struct closure *cl, struct timer_list *timer)
{
if (del_timer_sync(timer))
closure_sub(cl, CLOSURE_TIMER + 1);
}
EXPORT_SYMBOL_GPL(__closure_flush_sync);
#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
static LIST_HEAD(closure_list);
static DEFINE_SPINLOCK(closure_list_lock);
void closure_debug_create(struct closure *cl)
{
unsigned long flags;
BUG_ON(cl->magic == CLOSURE_MAGIC_ALIVE);
cl->magic = CLOSURE_MAGIC_ALIVE;
spin_lock_irqsave(&closure_list_lock, flags);
list_add(&cl->all, &closure_list);
spin_unlock_irqrestore(&closure_list_lock, flags);
}
EXPORT_SYMBOL_GPL(closure_debug_create);
void closure_debug_destroy(struct closure *cl)
{
unsigned long flags;
BUG_ON(cl->magic != CLOSURE_MAGIC_ALIVE);
cl->magic = CLOSURE_MAGIC_DEAD;
spin_lock_irqsave(&closure_list_lock, flags);
list_del(&cl->all);
spin_unlock_irqrestore(&closure_list_lock, flags);
}
EXPORT_SYMBOL_GPL(closure_debug_destroy);
static struct dentry *debug;
#define work_data_bits(work) ((unsigned long *)(&(work)->data))
static int debug_seq_show(struct seq_file *f, void *data)
{
struct closure *cl;
spin_lock_irq(&closure_list_lock);
list_for_each_entry(cl, &closure_list, all) {
int r = atomic_read(&cl->remaining);
seq_printf(f, "%p: %pF -> %pf p %p r %i ",
cl, (void *) cl->ip, cl->fn, cl->parent,
r & CLOSURE_REMAINING_MASK);
seq_printf(f, "%s%s%s%s%s%s\n",
test_bit(WORK_STRUCT_PENDING,
work_data_bits(&cl->work)) ? "Q" : "",
r & CLOSURE_RUNNING ? "R" : "",
r & CLOSURE_BLOCKING ? "B" : "",
r & CLOSURE_STACK ? "S" : "",
r & CLOSURE_SLEEPING ? "Sl" : "",
r & CLOSURE_TIMER ? "T" : "");
if (r & CLOSURE_WAITING)
seq_printf(f, " W %pF\n",
(void *) cl->waiting_on);
seq_printf(f, "\n");
}
spin_unlock_irq(&closure_list_lock);
return 0;
}
static int debug_seq_open(struct inode *inode, struct file *file)
{
return single_open(file, debug_seq_show, NULL);
}
static const struct file_operations debug_ops = {
.owner = THIS_MODULE,
.open = debug_seq_open,
.read = seq_read,
.release = single_release
};
int __init closure_debug_init(void)
{
debug = debugfs_create_file("closures", 0400, NULL, NULL, &debug_ops);
return 0;
}
module_init(closure_debug_init);
#endif
MODULE_AUTHOR("Kent Overstreet <koverstreet@google.com>");
MODULE_LICENSE("GPL");
This diff is collapsed.
This diff is collapsed.
#ifndef _BCACHE_DEBUG_H
#define _BCACHE_DEBUG_H
/* Btree/bkey debug printing */
#define KEYHACK_SIZE 80
struct keyprint_hack {
char s[KEYHACK_SIZE];
};
struct keyprint_hack bch_pkey(const struct bkey *k);
struct keyprint_hack bch_pbtree(const struct btree *b);
#define pkey(k) (&bch_pkey(k).s[0])
#define pbtree(b) (&bch_pbtree(b).s[0])
#ifdef CONFIG_BCACHE_EDEBUG
unsigned bch_count_data(struct btree *);
void bch_check_key_order_msg(struct btree *, struct bset *, const char *, ...);
void bch_check_keys(struct btree *, const char *, ...);
#define bch_check_key_order(b, i) \
bch_check_key_order_msg(b, i, "keys out of order")
#define EBUG_ON(cond) BUG_ON(cond)
#else /* EDEBUG */
#define bch_count_data(b) 0
#define bch_check_key_order(b, i) do {} while (0)
#define bch_check_key_order_msg(b, i, ...) do {} while (0)
#define bch_check_keys(b, ...) do {} while (0)
#define EBUG_ON(cond) do {} while (0)
#endif
#ifdef CONFIG_BCACHE_DEBUG
void bch_btree_verify(struct btree *, struct bset *);
void bch_data_verify(struct search *);
#else /* DEBUG */
static inline void bch_btree_verify(struct btree *b, struct bset *i) {}
static inline void bch_data_verify(struct search *s) {};
#endif
#ifdef CONFIG_DEBUG_FS
void bch_debug_init_cache_set(struct cache_set *);
#else
static inline void bch_debug_init_cache_set(struct cache_set *c) {}
#endif
#endif
This diff is collapsed.
This diff is collapsed.
#ifndef _BCACHE_JOURNAL_H
#define _BCACHE_JOURNAL_H
/*
* THE JOURNAL:
*
* The journal is treated as a circular buffer of buckets - a journal entry
* never spans two buckets. This means (not implemented yet) we can resize the
* journal at runtime, and will be needed for bcache on raw flash support.
*
* Journal entries contain a list of keys, ordered by the time they were
* inserted; thus journal replay just has to reinsert the keys.
*
* We also keep some things in the journal header that are logically part of the
* superblock - all the things that are frequently updated. This is for future
* bcache on raw flash support; the superblock (which will become another
* journal) can't be moved or wear leveled, so it contains just enough
* information to find the main journal, and the superblock only has to be
* rewritten when we want to move/wear level the main journal.
*
* Currently, we don't journal BTREE_REPLACE operations - this will hopefully be
* fixed eventually. This isn't a bug - BTREE_REPLACE is used for insertions
* from cache misses, which don't have to be journaled, and for writeback and
* moving gc we work around it by flushing the btree to disk before updating the
* gc information. But it is a potential issue with incremental garbage
* collection, and it's fragile.
*
* OPEN JOURNAL ENTRIES:
*
* Each journal entry contains, in the header, the sequence number of the last
* journal entry still open - i.e. that has keys that haven't been flushed to
* disk in the btree.
*
* We track this by maintaining a refcount for every open journal entry, in a
* fifo; each entry in the fifo corresponds to a particular journal
* entry/sequence number. When the refcount at the tail of the fifo goes to
* zero, we pop it off - thus, the size of the fifo tells us the number of open
* journal entries
*
* We take a refcount on a journal entry when we add some keys to a journal
* entry that we're going to insert (held by struct btree_op), and then when we
* insert those keys into the btree the btree write we're setting up takes a
* copy of that refcount (held by struct btree_write). That refcount is dropped
* when the btree write completes.
*
* A struct btree_write can only hold a refcount on a single journal entry, but
* might contain keys for many journal entries - we handle this by making sure
* it always has a refcount on the _oldest_ journal entry of all the journal
* entries it has keys for.
*
* JOURNAL RECLAIM:
*
* As mentioned previously, our fifo of refcounts tells us the number of open
* journal entries; from that and the current journal sequence number we compute
* last_seq - the oldest journal entry we still need. We write last_seq in each
* journal entry, and we also have to keep track of where it exists on disk so
* we don't overwrite it when we loop around the journal.
*
* To do that we track, for each journal bucket, the sequence number of the
* newest journal entry it contains - if we don't need that journal entry we
* don't need anything in that bucket anymore. From that we track the last
* journal bucket we still need; all this is tracked in struct journal_device
* and updated by journal_reclaim().
*
* JOURNAL FILLING UP:
*
* There are two ways the journal could fill up; either we could run out of
* space to write to, or we could have too many open journal entries and run out
* of room in the fifo of refcounts. Since those refcounts are decremented
* without any locking we can't safely resize that fifo, so we handle it the
* same way.
*
* If the journal fills up, we start flushing dirty btree nodes until we can
* allocate space for a journal write again - preferentially flushing btree
* nodes that are pinning the oldest journal entries first.
*/