umem  1.0.1
Main Page

README

Portable libumem.
================
This is a port of Solaris libumem to non-Solaris systems.
The port was made while integrating libumem with our Ecelerity MTA product, so
your initial experience will not be 100% out-of-the-box, because there is no
standalone configure script for the library at this time. (patches welcome!)
In addition, since our deployment is threaded, we force the library into
threaded mode.
While the library is itself stable (it's the memory allocator used by the
Solaris OS), the port may have a few rough edges. We're shipping umem with
Linux and Windows versions of our product as we have found it to be stable.
We will continue to update this project as and when we make improvements, and
welcome third-party patches that improve the usability for everyone.
Wez Furlong,
Message Systems, Inc.
wez (at) messagesystems (dot) com

Nuances

There is a nuance in the behaviour of the umem port compared with umem on Solaris.

On Linux umem will not return memory back to the OS until umem fails to allocate a chunk. On failure, umem_reap() will be called automatically, to return memory to the OS. If your code is going to be running for a long time on Linux and mixes calls to different memory allocators (e.g.: malloc()) and umem, your code will need to call umem_reap() periodically.

This doesn't happen on Solaris, because malloc is replaced with umem calls, meaning that umem_reap() is called automatically.

References

http://docs.sun.com/app/docs/doc/816-5173/6mbb8advq?a=view

http://access1.sun.com/techarticles/libumem.html

Overview

based on usr/src/uts/common/os/kmem.c r1.64 from 2001/12/18
The slab allocator, as described in the following two papers:
Jeff Bonwick,
The Slab Allocator: An Object-Caching Kernel Memory Allocator.
Proceedings of the Summer 1994 Usenix Conference.
Available as /shared/sac/PSARC/1994/028/materials/kmem.pdf.
Jeff Bonwick and Jonathan Adams,
Magazines and vmem: Extending the Slab Allocator to Many CPUs and
Arbitrary Resources.
Proceedings of the 2001 Usenix Conference.
Available as /shared/sac/PSARC/2000/550/materials/vmem.pdf.
1. Overview
-----------
umem is very close to kmem in implementation. There are four major
areas of divergence:
* Initialization
* CPU handling
* umem_update()
* KM_SLEEP v.s. UMEM_NOFAIL
2. Initialization
-----------------
kmem is initialized early on in boot, and knows that no one will call
into it before it is ready. umem does not have these luxuries. Instead,
initialization is divided into two phases:
* library initialization, and
* first use
umem's full initialization happens at the time of the first allocation
request (via malloc() and friends, umem_alloc(), or umem_zalloc()),
or the first call to umem_cache_create().
umem_free(), and umem_cache_alloc() do not require special handling,
since the only way to get valid arguments for them is to successfully
call a function from the first group.
2.1. Library Initialization: umem_startup()
-------------------------------------------
umem_startup() is libumem.so's .init section. It calls pthread_atfork()
to install the handlers necessary for umem's Fork1-Safety. Because of
race condition issues, all other pre-umem_init() initialization is done
statically (i.e. by the dynamic linker).
For standalone use, umem_startup() returns everything to its initial
state.
2.2. First use: umem_init()
------------------------------
The first time any memory allocation function is used, we have to
create the backing caches and vmem arenas which are needed for it.
umem_init() is the central point for that task. When it completes,
umem_ready is either UMEM_READY (all set) or UMEM_READY_INIT_FAILED (unable
to initialize, probably due to lack of memory).
There are four different paths from which umem_init() is called:
* from umem_alloc() or umem_zalloc(), with 0 < size < UMEM_MAXBUF,
* from umem_alloc() or umem_zalloc(), with size > UMEM_MAXBUF,
* from umem_cache_create(), and
* from memalign(), with align > UMEM_ALIGN.
The last three just check if umem is initialized, and call umem_init()
if it is not. For performance reasons, the first case is more complicated.
2.2.1. umem_alloc()/umem_zalloc(), with 0 < size < UMEM_MAXBUF
-----------------------------------------------------------------
In this case, umem_cache_alloc(&umem_null_cache, ...) is called.
There is special case code in which causes any allocation on
&umem_null_cache to fail by returning (NULL), regardless of the
flags argument.
So umem_cache_alloc() returns NULL, and umem_alloc()/umem_zalloc() call
umem_alloc_retry(). umem_alloc_retry() sees that the allocation
was agains &umem_null_cache, and calls umem_init().
If initialization is successful, umem_alloc_retry() returns 1, which
causes umem_alloc()/umem_zalloc() to start over, which causes it to load
the (now valid) cache pointer from umem_alloc_table.
2.2.2. Dealing with race conditions
-----------------------------------
There are a couple race conditions resulting from the initialization
code that we have to guard against:
* In umem_cache_create(), there is a special UMC_INTERNAL cflag
that is passed for caches created during initialization. It
is illegal for a user to try to create a UMC_INTERNAL cache.
This allows initialization to proceed, but any other
umem_cache_create()s will block by calling umem_init().
* Since umem_null_cache has a 1-element cache_cpu, it's cache_cpu_mask
is always zero. umem_cache_alloc uses cp->cache_cpu_mask to
mask the cpu number. This prevents a race between grabbing a
cache pointer out of umem_alloc_table and growing the cpu array.
3. CPU handling
---------------
kmem uses the CPU's sequence number to determine which "cpu cache" to
use for an allocation. Currently, there is no way to get the sequence
number in userspace.
umem keeps track of cpu information in umem_cpus, an array of umem_max_ncpus
umem_cpu_t structures. CURCPU() is a a "hint" function, which we then mask
with either umem_cpu_mask or cp->cache_cpu_mask to find the actual "cpu" id.
The mechanics of this is all in the CPU(mask) macro.
Currently, umem uses _lwp_self() as its hint.
4. The update thread
--------------------
kmem uses a task queue, kmem_taskq, to do periodic maintenance on
every kmem cache. vmem has a periodic timeout for hash table resizing.
The kmem_taskq also provides a separate context for kmem_cache_reap()'s
to be done in, avoiding issues of the context of kmem_reap() callers.
Instead, umem has the concept of "updates", which are asynchronous requests
for work attached to single caches. All caches with pending work are
on a doubly linked list rooted at the umem_null_cache. All update state
is protected by the umem_update_lock mutex, and the umem_update_cv is used
for notification between threads.
4.1. Cache states with regards to updates
-----------------------------------------
A given cache is in one of three states:
Inactive cache_uflags is zero, cache_u{next,prev} are NULL
Work Requested cache_uflags is non-zero (but UMU_ACTIVE is not set),
cache_u{next,prev} link the cache onto the global
update list
Active cache_uflags has UMU_ACTIVE set, cache_u{next,prev}
are NULL, and either umem_update_thr or
umem_st_update_thr are actively doing work on the
cache.
An update can be added to any cache in any state -- if the cache is
Inactive, it transitions to being Work Requested. If the cache is
Active, the worker will notice the new update and act on it before
transitioning the cache to the Inactive state.
If a cache is in the Active state, UMU_NOTIFY can be set, which asks
the worker to broadcast the umem_update_cv when it has finished.
4.2. Update interface
---------------------
umem_add_update() adds an update to a particular cache.
umem_updateall() adds an update to all caches.
umem_remove_updates() returns a cache to the Inactive state.
umem_process_updates() process all caches in the Work Requested state.
4.3. Reaping
------------
When umem_reap() is called (at the time of heap growth), it schedule
UMU_REAP updates on every cache. It then checks to see if the update
thread exists (umem_update_thr != 0). If it is, it broadcasts
the umem_update_cv to wake the update thread up, and returns.
If the update thread does not exist (umem_update_thr == 0), and the
program currently has multiple threads, umem_reap() attempts to create
a new update thread.
If the process is not multithreaded, or the creation fails, umem_reap()
calls umem_st_update() to do an inline update.
4.4. The update thread
----------------------
The update thread spends most of its time in cond_timedwait() on the
umem_update_cv. It wakes up under two conditions:
* The timedwait times out, in which case it needs to run a global
update, or
* someone cond_broadcast(3THR)s the umem_update_cv, in which case
it needs to check if there are any caches in the Work Requested
state.
When it is time for another global update, umem calls umem_cache_update()
on every cache, then calls vmem_update(), which tunes the vmem structures.
umem_cache_update() can request further work using umem_add_update().
After any work from the global update completes, the update timer is
reset to umem_reap_interval seconds in the future. This makes the
updates self-throttling.
Reaps are similarly self-throttling. After a UMU_REAP update has
been scheduled on all caches, umem_reap() sets a flag and wakes up the
update thread. The update thread notices the flag, and resets the
reap state.
4.5. Inline updates
-------------------
If the update thread is not running, umem_st_update() is used instead. It
immediately does a global update (as above), then calls
umem_process_updates() to process both the reaps that umem_reap() added and
any work generated by the global update. Afterwards, it resets the reap
state.
While the umem_st_update() is running, umem_st_update_thr holds the thread
id of the thread performing the update.
4.6. Updates and fork1()
------------------------
umem has fork1() pre- and post-handlers which lock up (and release) every
mutex in every cache. They also lock up the umem_update_lock. Since
fork1() only copies over a single lwp, other threads (including the update
thread) could have been actively using a cache in the parent. This
can lead to inconsistencies in the child process.
Because we locked all of the mutexes, the only possible inconsistancies are:
* a umem_cache_alloc() could leak its buffer.
* a caller of umem_depot_alloc() could leak a magazine, and all the
buffers contained in it.
* a cache could be in the Active update state. In the child, there
would be no thread actually working on it.
* a umem_hash_rescale() could leak the new hash table.
* a umem_magazine_resize() could be in progress.
* a umem_reap() could be in progress.
The memory leaks we can't do anything about. umem_release_child() resets
the update state, moves any caches in the Active state to the Work Requested
state. This might cause some updates to be re-run, but UMU_REAP and
UMU_HASH_RESCALE are effectively idempotent, and the worst that can
happen from umem_magazine_resize() is resizing the magazine twice in close
succession.
Much of the cleanup in umem_release_child() is skipped if
umem_st_update_thr == thr_self(). This is so that applications which call
fork1() from a cache callback does not break. Needless to say, any such
application is tremendously broken.
5. KM_SLEEP v.s. UMEM_NOFAIL
----------------------------
Allocations against kmem and vmem have two basic modes: SLEEP and
NOSLEEP. A sleeping allocation is will go to sleep (waiting for
more memory) instead of failing (returning NULL).
SLEEP allocations presume an extremely multithreaded model, with
a lot of allocation and deallocation activity. umem cannot presume
that its clients have any particular type of behavior. Instead,
it provides two types of allocations:
* UMEM_DEFAULT, equivalent to KM_NOSLEEP (i.e. return NULL on
failure)
* UMEM_NOFAIL, which, on failure, calls an optional callback
(registered with umem_nofail_callback()).
The callback is invoked with no locks held, and can do an arbitrary
amount of work. It then has a choice between:
* Returning UMEM_CALLBACK_RETRY, which will cause the allocation
to be restarted.
* Returning UMEM_CALLBACK_EXIT(status), which will cause exit(2)
to be invoked with status. If multiple threads attempt to do
this simultaneously, only one will call exit(2).
* Doing some kind of non-local exit (thr_exit(3thr), longjmp(3C),
etc.)
The default callback returns UMEM_CALLBACK_EXIT(255).
To have these callbacks without risk of state corruption (in the case of
a non-local exit), we have to ensure that the callbacks get invoked
close to the original allocation, with no inconsistent state or held
locks. The following steps are taken:
* All invocations of vmem are VM_NOSLEEP.
* All constructor callbacks (which can themselves to allocations)
are passed UMEM_DEFAULT as their required allocation argument. This
way, the constructor will fail, allowing the highest-level allocation
invoke the nofail callback.
If a constructor callback _does_ do a UMEM_NOFAIL allocation, and
the nofail callback does a non-local exit, we will leak the
partially-constructed buffer.