Run Terracotta Jobs Across the Cluster
28 Apr
In my Introduction to Terracotta, I alluded to the possibility of running jobs across the cluster but I didn’t really explain how. My first attempts involved using the tim-messaging package, but there wasn’t much there geared towards running the same job across all nodes. It mainly addresses dividing jobs across many workers (local or remote).
There are a few cases where the same job on all nodes is useful. The first time I ran into it was trying to pull some custom statistics from each node. Each node kept a moving average of requests per second for a specific user action I wanted to track. The other use case was notifying long poll HTTP clients. Server push could be initiated from one node but the client may be listening on another. Since Socket instances aren’t shareable, I needed a simple way to server push on all nodes.
I cast around until I found the old tclib forge project that predated many of the current TIMs. I modified the below class starting from that code.
To solve the problem of knowing what nodes are in the cluster, I simply have them register() themselves at startup. A ServletContextListener works great for that. That solves the problem of submitting jobs before the node is ready.
Of course, then you have to worry about nodes that may suddenly disappear. One can’t count on them unregistering themselves since they might crash or something. To fix that I have each node hold a lock the entire time they’re running. One of the necessities that Terracotta must have had to address early on is cleaning up cluster-wide locks when a client exists. To test if a node is still around, all the caller has to do is attempt to acquire the lock. If it’s successful, then that node somehow exited it’s run loop.
In practice this works pretty well. It’s much simpler than the JMX solution, although it doesn’t bother trying to resubmit jobs or some of the other neat features of tim-messaging. The tryLock() below was plenty fast enough for my purposes. Of course, it’s easy to envision adding a thread to clean up the queues instead of forcing the caller to wait for the tests. Really, that would depend on the size of your cluster.
Depending on it’s use, you might tweak the thread pool used for this. This code also depends on the tim-annotations project but you can easily configure tc-config.xml.
You might also want to look at the Quartz Tim at http://forge.terracotta.org/releases/projects/tim-quartz/
Awesome, I’ll check it out…
Hi Mike,
Nice post!
Regarding discovering which nodes are in the cluster, you can now also use the plain Java cluster API we introduced with Terracotta 3.0. It makes your life a lot easier when doing cluster-aware operations.
More info here:
http://www.terracotta.org/web/display/docs/Cluster+Events
Take care,
Geert
That’s awesome, thanks! That makes it a ton easier to do stuff like this.
Very minor point though: this way makes it possible to run clients on the cluster that don’t all have the same code. Of course, that’s not how most people use their clusters. :-)
Yeah, that’s actually one of my favorite unknown features of Terracotta … being able to have cluster state and cluster coordination without having to run the same applications on each node. As long as the root types are compatible, you can even name them when they’re in different classes and field … it will work seamlessly. Very cool stuff.
The most common way to choose special behavior for nodes in a cluster is to have them all hit a CyclicBarrier on startup and then use the arrival order (returned from await()) to trigger specific behavior. For example, the node with 0 arrival order might become a special kind of node.