Introduction to Terracotta
5 Jan
I’ve been struggling to explain how
much of a leap forward Terracotta
is for Java programming. Of course, Terracotta is a cluster in the
fullest sense of the word, but "cluster" brings to mind
many painful years of parallel development. They required extensive
setup and a lot of code changes. Terracotta is different.
Terracotta works at the JVM level,
silently modifying the behavior of some classes you choose to share.
Suddenly you can use a ConcurrentHashMap across the cluster, without
code changes, in an efficient and transactional way. Now not only do
you have all of the benefits of a cluster environment, but you can
run your own code without change — or even a vendor’s product. They
have support for many common Open Source projects and you can even
run containers from Tomcat to WebLogic on your cluster. Or all at
once, if you prefer.
Terracotta is also sometimes described
as "network attached memory." That’s apt, but it doesn’t,
you know, sound cool enough. For many, it takes a bit of
experimentation with Terracotta until they realize: share memory and
do anything. It’s not a very
interesting program that doesn’t use memory. In fact, most programs
exist to do some computation on memory or data.
The
pain starts when you need to share that memory with another process
or system. There have been all sorts of methods. We had COBRA, but
that was a bummer because of the IDL files. We’ve had RMI and Mule
big enterprisey XML
and 10,000 other ways to simply move a piece of data. Many
programmers even threw up their hands and used the database as an
Inter-Process Communication (IPC) method, storing memory that had no
business in the database.
Those systems can
often be replaced by a simple LinkedBlockingQueue. If every node on
your cluster starts a thread that waits on the queue, then you can
simply put data in the queue for other nodes to retrieve. Checkout
the tim-messaging
(TIM: Terracotta Integration Module) project on the forge for a great
library that adds all of the goodies needed for a reliable solution.
Even better,
Terracotta lets you program as if you’re on a single JVM and deploy
to the cluster. Sharing memory in your cluster is no more difficult
than sharing in a single process. That means setting up and testing a
message bus or writing for the cluster can easily be done on a single
developer’s machine.
If that doesn’t
sound like a huge leap forward, consider that I once spent several
weeks trying to get my personal OpenMOSIX cluster to migrate
processes. I never did figure out what black magic they used, but
finally, after a week or two of hacking, I started seeing my jobs
move across the cluster. And for all my effort… it took twice as
long as the single process version.
Of course, that was
long before I’d even used Java and when I still thought C was the
only language I’d ever need. Never did I imagine declaring a
LinkedBlockingQueue<Runnable> and literally executing code on
remote nodes. Simply put: share memory, do anything.
Caching
The Achilles heel
of parallel and cluster programming is usually the data. If a job to
be executed across the cluster requires every node to run a SQL
statement, then any benefits to the cluster model are quickly lost.
The database will quickly become the bottleneck that no amount of
cluster optimizations will solve.
That’s why I
recommend the first step to Terracotta is to use it as a distributed
cache. After all, once your data is in the cluster, then acting on it
from several nodes becomes a real possibility. Instead of shipping
data around (and probably spending far too much time deciding which
node gets what chunk and sending it to them), the program can simply
act on the cache. Terracotta has already spent the time making that
as efficient as possible.
Terracotta
makes for a great distributed cache solution, too. It makes sure that
all of your transactions are ACID (Atomicity, Consistency, Isolation,
Durability), the same standard databases are held to, but it also
ensures programmers can use familiar Java language constructs to
manage memory. This makes it possible to rely less on or even consider
killing
the database altogether.
Unlike many
solutions, Terracotta doesn’t depend on the slow Java serialization
process to manage objects in the cache. It’s tightly integrated with
the JVM so it can do far smarter optimizations. For example, when a
cluster node requests a map entry from the cache, Terracotta doesn’t
have to ship the entire object tree. It only makes available the
entries as they’re needed. That alone saves metric tons of network
usage.
Terracotta has many
other great features as a cache. Their documentation is quite fond of
calling the cluster nodes the L1 layer and the cluster server the L2,
similar to how a CPU cache works, for a good reason. As memory fills
up on the cluster node, the client can flush it’s less frequently
used memory. The server is responsible for the consistency of the
data so the cluster client is not relied on for data integrity. The
server process also keeps track of it’s hot memory and can persist
it’s own data to disk as memory fills up on the server JVM and remove
old data from it’s heap. You can even configure checkpoints on the
disk storage for backup purposes.
All this means a
node can easily access a 2G Map instance with a smaller client heap.
Of course, performance won’t be as great as there will be overhead
depending on how the memory is used, but the most frequently
requested data on a JVM will be in main memory. Rather than memcached
or other solutions, getting a value is nearly as fast as using the
Java heap because it often is on the heap.
Terracotta’s
redundancy also allows programmers to depend on their cache. After
all, if the program is running, then the cache is working. This is a
far cry from many other solutions that discourage design patterns
like asynchronous database writes or even enumerating the cache’s
keys because the cache is not persistent. With the reliability of
Terracotta backing the cache, such a design is easy to envision.
There’s even an integration
module aimed at just such a use.
High Availability
All this effort
adding caching didn’t go to waste. While you weren’t looking your web
applications also gained some great High Availability (HA) features.
Terracotta’s engineers were probably bored after they solved
concurrent programming.
Terracotta Sessions
will store and manage session data, meaning moving from one server to
another doesn’t mean users must log into the application again.
Session state can be preserved and trusted. With the addition of a
nice load balancer, scaling is simply a matter of adding another node
to the cluster. Not only does the distributed cache ease the load on
the database layer, but scaling the application layer is completely
transparent to the end user.
Also, since session
data is managed by Terracotta, server admins can now bring down
cluster nodes for maintenance without disrupting the application. In
fact, you can even patch and upgrade Terracotta itself without
stopping the whole cluster.
For
the cluster itself, Terracotta can be configured
to withstand the loss of a server. Data integrity is ensured at all
levels.
Open Source
As if
creating a complete end-to-end solution for some of the most
difficult scaling problems in Java applications wasn’t enough, Terracotta is also
Open Source, has great documentation, and their developers respond
quickly to forum questions.
It is
a much easier sell to management and other developers when there’s no
up-front cost to experimenting. Of course, before moving to
production most organizations will require some kind of training and
support, which Terracotta also provides.
I
encourage you to download and play with the example applications.
It’ll provide a good taste of what’s possible. Terracotta is truly a
leap forward. With a little experimentation, you’ll quickly discover
simple solutions for whole classes of difficult problems.
Watch
this space for upcoming articles on Terracotta. The first is
tentatively titled, "Terracotta: a tale of two projects."