Lru Implementation in Production Code

LRU implementation in production code

Recently I implemented a LRU cache using a linked list spread over a hash map.

    /// Typedef for URL/Entry pair
    typedef std::pair< std::string, Entry > EntryPair;

    /// Typedef for Cache list
    typedef std::list< EntryPair > CacheList;

    /// Typedef for URL-indexed map into the CacheList
    typedef boost::unordered_map< std::string, CacheList::iterator > CacheMap;

    /// Cache LRU list
    CacheList mCacheList;

    /// Cache map into the list
    CacheMap mCacheMap;

It has the advantage of being O(1) for all important operations.

The insertion algorithm:

// create new entry
Entry iEntry( ... );

// push it to the front;
mCacheList.push_front( std::make_pair( aURL, iEntry ) );

// add it to the cache map
mCacheMap[ aURL ] = mCacheList.begin();

// increase count of entries
mEntries++;

// check if it's time to remove the last element
if ( mEntries > mMaxEntries )
{
    // erease from the map the last cache list element
    mCacheMap.erase( mCacheList.back().first );

    // erase it from the list
    mCacheList.pop_back();

    // decrease count
    mEntries--;
}

LRU cache design

A linked list + hashtable of pointers to the linked list nodes is the usual way to implement LRU caches. This gives O(1) operations (assuming a decent hash). Advantage of this (being O(1)): you can do a multithreaded version by just locking the whole structure. You don't have to worry about granular locking etc.

Briefly, the way it works:

On an access of a value, you move the corresponding node in the linked list to the head.

When you need to remove a value from the cache, you remove from the tail end.

When you add a value to cache, you just place it at the head of the linked list.

Thanks to doublep, here is site with a C++ implementation: Miscellaneous Container Templates.

Least Recently Used cache using C++

One major issue with LRU caches is that there is little "const" operations, most will change the underlying representation (if only because they bump the element accessed).

This is of course very inconvenient, because it means it's not a traditional STL container, and therefore any idea of exhibiting iterators is quite complicated: when the iterator is dereferenced this is an access, which should modify the list we are iterating on... oh my.

And there are the performances consideration, both in term of speed and memory consumption.

It is unfortunate, but you'll need some way to organize your data in a queue (LRU) (with the possibility to remove elements from the middle) and this means your elements will have to be independant from one another. A std::list fits, of course, but it's more than you need. A singly-linked list is sufficient here, since you don't need to iterate the list backward (you just want a queue, after all).

However one major drawback of those is their poor locality of reference, if you need more speed you'll need to provide your own custom (pool ?) allocator for the nodes, so that they are kept as close together as possible. This will also alleviate heap fragmentation somewhat.

Next, you obviously need an index structure (for the cache bit). The most natural is to turn toward a hash map. std::tr1::unordered_map, std::unordered_map or boost::unordered_map are normally good quality implementation, some should be available to you. They also allocate extra nodes for hash collision handling, you might prefer other kinds of hash maps, check out Wikipedia's article on the subject and read about the characteristics of the various implementation technics.

Continuing, there is the (obvious) threading support. If you don't need thread support, then it's fine, if you do however, it's a bit more complicated:

As I said, there is little const operation on such a structure, thus you don't really need to differentiate Read/Write accesses
Internal locking is fine, but you might find that it doesn't play nice with your uses. The issue with internal locking is that it doesn't support the concept of "transaction" since it relinquish the lock between each call. If this is your case, transform your object into a mutex and provide a std::unique_ptr<Lock> lock() method (in debug, you can assert than the lock is taken at the entry point of each method)
There is (in locking strategies) the issue of reentrance, ie the ability to "relock" the mutex from within the same thread, check Boost.Thread for more information about the various locks and mutexes available

Finally, there is the issue of error reporting. Since it is expected that a cache may not be able to retrieve the data you put in, I would consider using an exception "poor taste". Consider either pointers (Value*) or Boost.Optional (boost::optional<Value&>). I would prefer Boost.Optional because its semantic is clear.

SET action upon encountering miss in LRU cache java implementation

The term cache generally identifies a key-value pair map with a limited capacity. Attempting to add a new pair when the capacity has been reached will cause one of the existing key-value pairs to be evicted, and the noew one to be added. Usually also, addition of a pair happens implicitly as a function of trying to retrieve the value for a key that is not currently stored in the cache (i.e. the cache performs the eviction, loads the new pair, and returns the value).

LRU (Least Recently Used) is one of the possible policies to select the pair to be evicted, not the only one. A processor data cache is an example of cache, implemented in hardware, where the key is obtained from an address (typically considering only a certain number of most-significant-bits), and the value is a portion of memory containing such address In this case, the cache line contains a copy of the data residing in memory.
Another example is, in OS supporting virtual memory, the OS paging implemented as a combination of HW (MMU) and software in the OS kernel. In this example, closer to what you describe, the value of the pair is a physical address for the memory page hosting the virtual page.

So when you ask "Also, in practical production code of LRU cache, does the cache contain the page numbers or actual values stored at those page numbers? T", the answer may depend on what cache we're talking about. In your case, it's not clear what the cache is backed by. I'll assume it's an offset in some file. In that case, I would expect your value to be a byte array or string representing a portion of the file. An access to offset x in the file would be mapped to a key = offset/PAGE_SIZE; if the key is not present in the table, and if the table is at capacity, then the last node could be "recycled", setting replacing the existing key, reading the content of the file at offsets (key, key + PAGE_SIZE-1) in the byte buffer, and finally moving the page as first.

Is it there any LRU implementation of IDictionary?

There is nothing in the base class libraries that does this.

On the free side, maybe something like C5's HashedLinkedList would work.

If you're willing to pay, maybe check out this C# toolkit. It contains an implementation.

How is an LRU cache implemented in a CPU?

For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.

For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.

For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.

On a MRU hit, the state is unchanged.
On a next-MRU hit the two bit fields are swapped.
On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.
On a miss, the state is updated as if an LRU hit had occurred.

Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.

For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.

With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.

Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.

Easy, simple to use LRU cache in java

You can use a LinkedHashMap (Java 1.4+) :

// Create cache
final int MAX_ENTRIES = 100;
Map cache = new LinkedHashMap(MAX_ENTRIES+1, .75F, true) {
    // This method is called just after a new entry has been added
    public boolean removeEldestEntry(Map.Entry eldest) {
        return size() > MAX_ENTRIES;
    }
};

// Add to cache
Object key = "key";
cache.put(key, object);

// Get object
Object o = cache.get(key);
if (o == null && !cache.containsKey(key)) {
    // Object not in cache. If null is not a possible value in the cache,
    // the call to cache.contains(key) is not needed
}

// If the cache is to be used by multiple threads,
// the cache must be wrapped with code to synchronize the methods
cache = (Map)Collections.synchronizedMap(cache);

Lru Implementation in Production Code