What Is the Best Library for Java to Grid/Cluster-Enable Your Application

What is the best library for Java to grid/cluster-enable your application?

There are several:

Terracotta (open source, based on Mozilla Public License);
Oracle Coherence (formerly Tangosol Coherence; commercial; based on JSR 107, which was never adopted officially);
GigaSpaces (commercial; based on JavaSpaces API, part of Jini);
GridGain, which you mentioned (open source: LGPL);
memcached with a Java client library (open source: BSD License;
EHCache (open source: Apache Software License;
OSCache (open source: modified Apache License; and
no doubt several others.

Now I haven't used all of these but I've used or investigated the majority of them.

GridGain and GigaSpaces are more centred around grid computing than caching and (imho) best suited to compute grids than data grids (see this explanation of compute vs data grids). I find GigaSpaces to be a really interesting technology and it has several licensing options, including a free version and a free full version for startups.

Coherence and Terracotta try to treat caches as Maps, which is a fairly natural abstraction. I've used Coherence a lot and it's an excellent high-performance product but not cheap. Terracotta I'm less familiar with. The documentation for Coherence I find a bit lacking at times but it really is a powerful product.

OSCache I've primarily used as a means of reducing memory usage and fragmentation in Java Web applications as it has a fairly neat JSP tag. If you've ever looked at compiled JSPs, you'll see they do a lot of String concatenations. This tag allows you to effectively cache the results of a segment of JSP code and HTML into a single String, which can hugely improve performance in some cases.

EHCache is an easy caching solution that I've also used in Web applications. Never as a distributed cache though but it can do that. I tend to view it as a quick and dirty solution but that's perhaps my bias.

memcached is particularly prevelent in the PHP world (and used by such sites as Facebook). It's a really light and easy solution and has the advantage that it doesn't run in the same process and you'll have arguably better interoperability options with other technology stacks, if this is important to you.

What architecture? Distribute content building across a cluster

Honestly I would rethink your approach and I'll tell you why.

I've done a lot of work on distributed high-volume systems (financial transactions specifically) and your solution--if the volume is sufficiently high (and I'll assume it is or you wouldn't be contemplating a clustered solution; you can get an awful lot of power out of one off-the-shelf box these days)--then you will kill yourself with remote calls (ie calls for data from another node).

I will speak about Tangosol/Oracle Coherence here because it's what I've got the most experience with, although Terracotta will support some or most of these features and is free.

In Coherence terms what you have is a partitioned cache where if you have n nodes, each node possesses 1/n of the total data. Typically you have redundancy of at least one level and that redundancy is spread as evenly as possible so each of the other n-1 nodes possesses 1/n-1 of the backup nodes.

The idea in such a solution is to try and make sure as many of the cache hits as possible are local (to the same cluster node). Also with partitioned caches in particular, writes are relatively espensive (and get more expensive with the more backup nodes you have for each cache entry)--although write-behind caching can minimize this--and reads are fairly cheap (which is what you want out of your requirements).

So your solution is going to ensure that every cache hit will be to a remote node.

Also consider that generating content is undoubtedly much more expensive than serving it, which I'll assume is why you came up with this idea because then you can have more content generators than servers. It's the more tiered approach and one I'd characterize as horizontal slicing.

You will achieve much better scalability if you can vertically slice your application. By that I mean that each node is responsible for storing, generating and serving a subset of all the content. This effectively eliminates internode communication (excluding backups) and allows you to adjust the solution by simply giving each node a different sized subset of the content.

Ideally, whatever scheme you choose for partitioning your data should be reproducible by your Web server so it knows exactly which node to hit for the relevant data.

Now you might have other reasons for doing it the way you're proposing but I can only answer this in the context of available information.

I'll also point you to a summary of grid/cluster technologies for Java I wrote in response to another question.

What are the different approaches for Java EE session replication?

You might want to look at Hazelcast and their HTTP Session Clustering feature

Java - Distributed Programming, RMI?

You may want to check out Hazelcast also. Hazelcast is an open source transactional, distributed/partitioned implementation of queue, topic, map, set, list, lock and executor service. It is super easy to work with; just add hazelcast.jar into your classpath and start coding. Almost no configuration is required.

If you are interested in executing your Runnable, Callable tasks in a distributed fashion, then please check out Distributed Executor Service documentation at http://code.google.com/docreader/#p=hazelcast

Hazelcast is released under Apache license and enterprise grade support is also available.

What would you recommend for a large-scale Java data grid technology: Terracotta, GigaSpaces, Coherence, etc?

We had a 50 servers running a webservice application and all these servers were load balanced using bigIP. The requirement was to cache each user state so that subsequent states don't do the same processing again and get the data from previous state. This way the client of the webservice don't need to maintain state.

We used Terracotta to cache the states and never faced any performance issue. At peak times number of request application is getting is 100 per second.

What caching model/framework with Websphere Spring Hibernate Oracle?

Well the hibernate cache system does just that, I used ehCache effectively and easily with Hibernate (and the second level cache system).

Lucene could be an option too depending on the situation. Hibernate Search or Compass could help with that (although it might take some major work).

Replication using Terracotta could also be an option although I've never done it.

What are my options to store and query huge amounts of data where a lot of it is repeating?

I would look at a column oriented database. It would be great for this sort of application

What Is the Best Library for Java to Grid/Cluster-Enable Your Application