Road to Akka: Fault tolerance by supervision

As developers we should strive for resilient, robust applications. The ability of software to still handle requests when something has failed is one of the 4 tenets of the reactive manifesto. One way of reacting to failure is by redundancy. Duplicate sofware components so that there is no single point of failure anymore. When one component goes down due to failure, a duplicate can take over. Akka uses a different form of fault tolerance: supervision

Supervision

Akka uses supervision to deal with errors. Supervision is something introduced in the Actor Model by the Erlang language. So supervision in the actor model basically means that one actor (the supervisor) is looking after another actor. The supervisor reacts to the failure of the actors it supervises. There are two possible ways in Akka to supervise actors: the DeadWatch (sounds more like the title of a political thriller novel, right? ) and parental supervising.

I already mentioned (see the post about creating actors) that actors are part of an actor system, which is in fact a hierachical system. This means that every actor has a parent, except for the root actor. Every parent is the supervisor of its children, hence, parental supervision. For clarity sake, let’s recap on the actor system.

The actor system recap

So what is this actor system? It is the infrastructure on which your actor hierarchy is build. It contains all the actors and all of the mailboxes. In Akka, the actor system is abstracted into a class with the same name (ActorSystem). Every actor has a parent, except for the root actor. The ActorSystem object is responsible for creating this actor. This image on the akka documentation gives a good representation of what you get when you create a new ActorSystem instance.

As you can see in the picture, when starting up a new actor system instance, there already is a hierarchy in place. You have the root actor which has 2 children: The user guardian and the system guardian. It is possible that some more actor live directly under the root actor, but those are special actors.

So obviously the root actor is the parent of all actors in the system. This is also the actor that is also the last one to stop when the system is stopped.

The system guardian is the guardian of all system created actors, such as logging actors.

The user guardian is the parent actor of all user made actors. This means when you as a developer create an actor using ‘actorSystem.actorOf’, you are creating a child of this actor. Actors directly created under this user guardian are called top level user actors.

Back to supervising and how it works

So in Akka, when an actor encounters an error, it’s parent is ready to deal with it in order to keep the system going. But how does that actually work?

When an actor suddenly fails, it suspends itself and all of it children (since the actor system is a hierarchy, it is logical for the children of a failing actor to be suspended as well). The failing actor sends a message to its parent indicating a failure. This is all out of the box Akka behaviour. You as a programmer, have to do nothing except telling the parent how to deal with the failure.

There are four possible options or strategies a supervising actor can use to deal with failure:

Resuming the child, which will cause it to take on the next message in the mailbox. The child will keep its internal state.
Restarting the child, which will cause it to lose all of its internal state.
Stop the child permanently.
Escalate the failure to its own supervisor and fail itself by doing so.

It is really important to keep thinking of the actor system as hierarchy. It makes a lot of sense when you consider the consequences of those fail strategies. So since the actor system is a hierarchy, when a child is told to resume actions, all of its children will be resumed as well. The same for restarting and stopping. When a supervisor tells a child to restart or to stop, all of its children will be restarted or stopped as well!

Also very interesting to know is that the communication between child and its parent in the context of supervision is going over separate mailboxes instead of the usual message mailboxes. This is interesting because you can not predict the order in which regular messages and supervision message will be processed by a supervisor. There is more documentation about ordering in the akka documentation here.

Cool! Now how do we code this?

So how does this translate to code? Behold an example:

public class Supervisor extends AbstractActor {
    private final LoggingAdapter logging = Logging.getLogger(getContext().getSystem(), this);
    
  // a lot of code omitted for the sake of clarity

    @Override
    public SupervisorStrategy supervisorStrategy() {
        PartialFunction decider = DeciderBuilder.match(SmallMistakeException.class, e -> SupervisorStrategy.resume())
                                    .match(NeedRestartException.class, e -> SupervisorStrategy.restart())
                                    .match(NeedsStopException.class, e -> SupervisorStrategy.stop())
                                    .match(EscalateException.class, e -> SupervisorStrategy.resume())
                                    .build();
        
        return new OneForOneStrategy(2, Duration.create(1.0, TimeUnit.MINUTES), decider);
    }
}

Telling an actor how to deal with failure of its children needs to be done by overriding the supervisionStrategy() method. As you can see, you build a PartialFunction which defines how to act under certain conditions. For every type of exception or error, you can define a strategy. In this example, I used very specific matching, but you don’t have to. Instead of matching on a NeedToRestartException, I could match on Exception instead.

Beware though, just like with the receive function, the matching is done from top to bottom order. So if I match on Exception first, and then match a strategy for a subtype of Exception, that strategy called in case of the subtype will never be executed!

The partial function is then given to the constructor of a OneForOneStrategy object. A one for one strategy means that the implemented actions are executed for each failing child actor seperately. There is also an AllForOneStrategy object. This should only be used for actors that tightly work together, because an AllForOneStrategy will be applied to all of the children in case just one fails.

The other 2 parameters of the OneForOneStrategy (in this case a 2 and a Duration object) are respectively the number of retries in a certain time range. As far as I know (it is what I understand from reading the akka documentation), this only counts for restarting strategies. But in this case, restarting will be retried 2 times in a timerange of 1 minute.

By default (so when not overriding the supervision strategy) the following strategies are implemented:

Exception is thrown: restart
Error is thrown: escalate
Exception during initialization of an Actor: stop

There is one exception to this default behaviour and that is when an ActorKilledException has been thrown. When an actor has been killed, its supervisor receives this exception and the strategy to deal with that is to stop the actor (which makes sense).

Guardian actors and supervision

What if a top level actor is failing, how will the user guardian respond? And what about the root actor and the system guardian? Those are interesting questions. Taken from the akka documentation:

The root actor will stop all of its children for every type of Exception. All other throwables will be escalated. Wait… what?? Escalated to what or who? The root actor escalates to a bubble-walker. This is a synthetic ActorRef that lives outside of the ActorSystem which will stop the root actor in case of a throwable.

The system guardian uses the restart strategy for all of its top-level system actors. Throwables are escalated which will result in a full system shutdown. The exception to the rule are exceptions of type ActorInitializationException and ActorKilledException. In those cases, the child actor is stopped.

The user guardian can have a configured supervision strategy. It does not state so in the documentation, but I think that the default strategies are the same as any other user created actor. Restart for exceptions and escalate for other throwables. Here the same story, when a failure is escalated to the root, this will cause the system to shut down.

More code examples

More code examples can be found here on my github. Have fun playing!