Shepherd × Goblins update

Juli Sims & David Thompson —

The Shepherd is an init system and process manager initially built for GNU Hurd and now used by Guix. It can run with either root or user privileges to launch daemons, execute tasks, and manage processes. As we’ve discussed previously, Spritely has been working to port the Shepherd to Goblins. We’ve been a bit quiet since that announcement, so what’s the buzz?

First, as a quick refresher, the Shepherd is a great project to port to Goblins because it’s already built on the actor model. By switching to Goblins, we can bring the following benefits to the project:

  • Streamline the codebase by replacing the Shepherd’s ad-hoc actor model implementation.
  • Reduce the likelihood of concurrency bugs caused by the existing actor model implementation that exposes too much of its CSP foundation using Fibers.
  • Transform services (and other actors) into object capabilities for fine-grained management of privileges, which will (eventually) make it to possible to unify the currently separate worlds of “system” Shepherds that run as root under PID 1 and “user” Shepherds that run as an unprivileged user.
  • Enable Shepherd to use the Object Capability Network (OCapN) to open the door for distributed networks of “Communicating Shepherd Processes” in the future.

Since our last post, we’ve done the following:

  • Wrote Goblins versions of the core actors like the service controller, service registry, and process monitor.
  • Added unit tests for all of the core actors (there were none before).
  • Rewrote the public API as a compatibility layer on top of a new Goblins actor API. This new API is private for now.

What this means is that all of the extant Shepherd functionality will soon be available in the Goblins port. We’re currently working out the remaining sneaky, tricky, subtle bugs in order to have a full 1:1 port that passes the existing test suite. We’re getting very close, so we felt it was time to share this update!

What we’ve been up to

To better explain the work we've been doing, we need to discuss some Shepherd internals, particularly how actors work. Shepherd includes an ad-hoc actor model that has notable differences with Goblins. Shepherd actors are implemented as an event loop running in a fiber (a lightweight thread) and send messages to each other over channels. In contrast, Goblins is also built on Fibers but mostly hides this behind an abstraction barrier. Rather than each actor managing its own event loop, many Goblins actors share an event loop known as a vat. Each Goblins actor is mapped to its current “behavior”, a procedure that is called when the actor receives a message. The differences between these two actor model implementations means that porting an actor from one to the other isn’t as straightforward as it might seem.

The central unit of abstraction is the service, which represents something managed by the Shepherd. This could be an external process, a one-off task (known as a “one-shot” service), a timer, or whatever the user wants it to be. Internally, services are represented by a record holding immutable configuration information and a service controller actor which manages the running state. The service controller (named ^service in the code) was the first target of our porting efforts as it helped elucidate the shape of the new architecture.

Going hand-in-hand with the service controller actor is the service registry (^service-registry). The service registry is responsible for mapping the names of services to their associated service controller. Porting the service registry was one of the simplest parts of the project; most of the original actor logic was copied over with minimal changes.

The Shepherd makes heavy use of dynamically scoped variables known as parameters to pass around shared state like the current registry, the current service, the current client socket, etc. This created an issue for us, however, as vats introduce a continuity barrier for parameters. In the existing actor system, actors inherit the dynamic environment in which they are spawned because each actor is a new fiber spun off the current fiber. In Goblins, actors are spawned within a vat’s event loop which has an entirely separate dynamic state from the caller. Furthemore, Goblins discourages the use of parameters because they are inherently ambient and thus not capability-safe.

Removing these parameters would be a backwards incompatible change, so instead we capture the current state of relevant parameters in the compatibility layer before passing those values off to Goblins actors. The most obvious use of this technique is for I/O handling. The Shepherd uses a custom soft port for logging to standard output, a client socket, and/or the system log. A little named actor tentatively named ^writer handles these concerns now.

The most significant change is the introduction of a coordinating actor called simply ^shepherd. This actor is where we pushed all of the logic related to starting, stopping, respawning, etc. Procedures such as start-service and stop-service are now thin wrappers around calls to this actor.

Related to process orchestration, the Shepherd has a process monitor actor, whose job is to watch for the termination of processes associated with services and notify other actors about it. This was also relatively simple to port to a Goblins ^process-monitor actor once all of the shared state was properly captured. Goblins promises somewhat simplified the logic involved in responding to these changing states.

Perhaps the trickiest part of the port was logging. Loggers read from an input port and write timestamped log lines to some destination, perhaps a file or the system log. Port I/O requires some understanding how Fibers and Goblins interact. If you’re not careful, you can suspend a vat’s fiber, potentially stalling the program. Goblins provides the ^io actor to handle many common I/O needs safely, but the needs of Shepherd were beyond what that actor could provide. A first attempt at porting this logic proved too buggy to rely upon, so we’ve recently reworked logging actors into something more robust (and more similar to the original logging actors, too).

Finally, we have made a variety of smaller changes so existing code plays nicely with new Goblins actors. For example, the Shepherd provides a collection of helpers for things like starting and stopping processes (make-forkexec-constructor, for example). The shift to Goblins required the introduction of Goblins message passing and promise handling to keep some of these working as expected. A lot of time has been spent devising ways to keep the public API the same so that existing user code will continue to function as expected, as if nothing has really changed. To support all of this work, we’ve introduced a few of our own helper procedures and macros, and we’ve modified some existing ones to be Goblins-friendly.

Whew, that’s a lot! It’s the culmination of over a year of work, so it can be difficult to take in. If you’d like to try, you can see the current state of the port in our WIP pull request on Codeberg!

Demo time

Okay, but does it work? We’re so glad you asked!

In addition to using Shepherd for its init system (PID 1), Guix provides helpful facilities for running user-level Shepherd daemons through the home-shepherd-service-type in guix home. This is the same kind of user Shepherd daemon mentioned before; Guix just provides a nice, declarative interface to configure and launch the daemon when defining a user's home-environment. We used this functionality to swap in our Goblins Shepherd and manage an Emacs background daemon with it. Here's an actual session running in the context of guix home container:

juli shepherd λ guix home container home-shepherd.scm
substitute: recherche des substituts sur « https://substitutes.nonguix.org »… 100.0%
substitute: recherche des substituts sur « https://bordeaux.guix.gnu.org »…   0.0%guix substitute: avertissement : bordeaux.guix.gnu.org : la connexion à échouée : Connexion refusée
substitute:
substitute: recherche des substituts sur « https://ci.guix.gnu.org »… 100.0%
Les dérivations suivantes seront compilées :
  /gnu/store/ixpviqjakf55j24ag523pdl5g9k8xld7-provenance.drv
  /gnu/store/ynykwmi237h4jxrgdgkwqs4sgvf1h3cc-home.drv

substitute: recherche des substituts sur « https://bordeaux.guix.gnu.org »…   0.0%
construction de /gnu/store/ixpviqjakf55j24ag523pdl5g9k8xld7-provenance.drv...
construction de /gnu/store/ynykwmi237h4jxrgdgkwqs4sgvf1h3cc-home.drv...
WARNING: (guile-user): imported module (guix build utils) overrides core binding `delete'
WARNING: (guile-user): imported module (guix build utils) overrides core binding `delete'
Symlinking /home/juli/.bash_profile -> /gnu/store/9vidh7q8sp353rb1jnrndyif9wl2fjna-bash_profile... done
Symlinking /home/juli/.profile -> /gnu/store/jjvk66x9wwzxw38byk796y9b6kvi21b0-shell-profile... done
Symlinking /home/juli/.bashrc -> /gnu/store/mdp6zf77631kqr8cw26p4m3vvbr7vk01-bashrc... done
Symlinking /home/juli/.config/shepherd/init.scm -> /gnu/store/nag703p683l66s2adad719810xfrhx3w-shepherd.conf... done
Symlinking /home/juli/.config/fontconfig/fonts.conf -> /gnu/store/bqhrpq7na79bxm3sbpmnana10g6sc4d5-fonts.conf... done
 done
Finished updating symlinks.

Comparing /gnu/store/non-existing-generation/profile/share/fonts and
          /gnu/store/yyyn7zy4lx8z9qsb41imkbxb11wrrqqc-home/profile/share/fonts... done (same)
Evaluating on-change gexps.

On-change gexps evaluation finished.

juli@sordidus ~$ herd status
Started:
 + emacs
 + root
juli@sordidus ~$ herd status emacs
Status of emacs:
  It is running since 19:40:05 (9 seconds ago).
  Main PID: 40
  Command: /gnu/store/b6f34g5rsz35z40fc0myimw9zgj654xj-emacs-no-x-30.1/bin/emacs --fg-daemon
  It is enabled.
  Provides: emacs
  Will be respawned.

Recent messages (use '-n' to view more or less):
  2025-09-05 19:40:06 Starting Emacs daemon.
juli@sordidus ~$ emacsclient -c
juli@sordidus ~$ herd stop emacs
juli@sordidus ~$ herd status emacs
Status of emacs:
  It is stopped since 19:40:31 (2 seconds ago).
  Process exited with code 15.
  It is enabled.
  Provides: emacs
  Will be respawned.
juli@sordidus ~$ herd start emacs
Service emacs has been started.
juli@sordidus ~$ herd status emacs
Status of emacs:
  It is running since 19:40:38 (2 seconds ago).
  Main PID: 144
  Command: /gnu/store/b6f34g5rsz35z40fc0myimw9zgj654xj-emacs-no-x-30.1/bin/emacs --fg-daemon
  It is enabled.
  Provides: emacs
  Will be respawned.

Recent messages (use '-n' to view more or less):
  2025-09-05 19:40:38 Starting Emacs daemon.
juli@sordidus ~$ herd restart emacs
Service emacs has been started.
juli@sordidus ~$ herd status emacs
Status of emacs:
  It is running since 19:40:45 (2 seconds ago).
  Main PID: 180
  Command: /gnu/store/b6f34g5rsz35z40fc0myimw9zgj654xj-emacs-no-x-30.1/bin/emacs --fg-daemon
  It is enabled.
  Provides: emacs
  Will be respawned.

Recent messages (use '-n' to view more or less):
  2025-09-05 19:40:45 Starting Emacs daemon.
juli@sordidus ~$ ls -al $(command -v herd)
lrwxrwxrwx 1 65534 overflow 80 Jan  1  1970 /home/juli/.guix-home/profile/bin/herd -> /gnu/store/4l4b2qb91bq3djj9ldg66jx6p98hxvin-goblins-shepherd-1.0.99-git/bin/herd

As simple as this demo is, it demonstrates that the Goblins Shepherd can already handle its basic job, despite the work left to achieve parity with mainline. If you’d like to try this out for yourself, first ensure you have Guix installed and up-to-date, then run the following commands:

git clone https://codeberg.org/spritely/shepherd
cd shepherd
git checkout -b goblins-shepherd-guix-home-demo
guix home container home-shepherd.scm

You can also look on Codeberg to see the home config itself.

“That’s so exciting!” we hear you saying; “When will this be shipping in a Guix distribution near me? When can I use Goblins to boot my operating system?” As exciting as this is, we’re not quite ready for prime time, as we’ll explain below.

Remaining work

Before we deploy Shepherd onto our own systems, and especially before we try it in PID 1, we want to ensure that we reliably pass Shepherd’s existing suite of shell-based tests. Somewhere between 4 and 7 tests fail as of writing; some tests fail every time, others only intermittently, indicating there may be some subtle race conditions lurking.

One key component that remains unsupported is system log support, the lack of which accounts for a considerable chunk of the remaining test failures. Nonetheless, our branch is passing nearly all of the existing tests, which is great progress! Once the test suite issues have been sorted out, we’ll try using our Shepherd build on a real Guix system and see how stable it is over time.

There is also plenty of code to clean up. We've left all the original actor code in place during development to make rebasing on upstream less prone to conflicts, but the time has come to start removing it. There are also numerous refactors that can be done to improve the code style and readability.

Beyond a direct port, though, this work will empower the Shepherd with everything Goblins and OCapN have to offer. So how could those powers be used? Well, we've got some ideas!

Single-system unification

On Guix systems, where Shepherd serves as the init system in PID 1, it is common to run additional Shepherd instances for unprivileged users. One option is to use guix home, as mentioned above. These Shepherd instances are entirely separate from each other. Only users with access to the root user (likely via sudo) can interact with the system Shepherd; it’s all-or-nothing. It would be nice to be able to give unprivileged users access to a subset of the system services, following the principle of least authority. The object capability security model provided by Goblins makes this possible!

herd clients communicate with shepherd daemons using a custom protocol over a Unix domain socket. If the client were to be modified to speak the OCapN protocol instead, users of herd would only be able to interact with services for which they hold a capability. Consider a shared server: the system administrator could give capabilities to other users of the machine that grant access to just a subset of the available system services — and perhaps to only a subset of the available service actions. The fine-grained nature of object capabilities means that access can be scoped to the minimum necessary for each user to do what they need.

As a first step in this direction, we’ve added a prerequisite component to Goblins, a Unix domain socket netlayer for OCapN. This is necessary to have interconnected machines running Shepherd communicate over OCapN at the system layer. Read on to see an example of what this might look like!

Fleet orchestration

Moving on from Shepherd on a single machine, an OCapN-enabled Shepherd will allow for orchestration of entire server fleets. To demonstrate, let’s walk through an example scenario. Carol, a DevOps engineer, is responsible for the web servers running on a small fleet of machines named A and B. Each machine is running Shepherd with a web-server service registered. To model this scenario on a single machine, we’ll use three Goblins vats:

(define a-vat (spawn-vat #:name "Server A"))
(define b-vat (spawn-vat #:name "Server B"))
(define c-vat (spawn-vat #:name "Carol"))

Then we’ll setup some loggers to distinguish which “machine” logged which line:

(define-actor (^prefix-logger bcom prefix)
  (lambda (str)
    (format #t "~a: ~a\n" prefix str)))

(define a-output (with-vat c-vat (spawn ^prefix-logger "A")))
(define b-output (with-vat c-vat (spawn ^prefix-logger "B")))
(define c-output (with-vat c-vat (spawn ^prefix-logger "C")))

Servers A and B have identical configuration with a web-server service that depends on the networking service:

(define (spawn-networking-service)
  (spawn ^service '(networking)
         #:start-handler (const #t)
         #:stop-handler (const #f)))

(define (spawn-web-server-service)
  (spawn ^service '(web-server)
         #:requirement '(networking)
         #:start-handler (const #t)
         #:stop-handler (const #f)))

(define a-registry (with-vat a-vat (spawn ^service-registry)))
(define a-shepherd (with-vat a-vat (spawn ^shepherd a-registry)))
(define a-networking (with-vat a-vat (spawn-networking-service)))
(define a-web-server (with-vat a-vat (spawn-web-server-service)))
(with-vat a-vat
  (all-of (<- a-shepherd 'register a-networking a-output)
          (<- a-shepherd 'register a-web-server a-output)))

(define b-registry (with-vat b-vat (spawn ^service-registry)))
(define b-shepherd (with-vat b-vat (spawn ^shepherd b-registry)))
(define b-networking (with-vat b-vat (spawn-networking-service)))
(define b-web-server (with-vat b-vat (spawn-web-server-service)))
(with-vat b-vat
  (all-of (<- b-shepherd 'register b-networking b-output)
          (<- b-shepherd 'register b-web-server b-output)))

Carol would like to issue a single command to start or stop all of the web servers. To do this, Carol first acquires references to the web-server service actors on each machine. At first glance this might seem to cause a name collision problem as both services have the same name, but fear not! Carol can assign locally meaningful names to these remote services in her local Shepherd. On Carol’s machine, the remote services are registered as web-server-a and web-server-b, respectively.

;; Naive, but enough for demo purposes.
(define-actor (^exported-service bcom writer shepherd service provision)
  (extend-methods service
    ((canonical-name) (car provision))
    ((provision) provision)
    ((requirement) '())
    (start
     (lambda args
       (let-on ((status (<- service 'status)))
         (match status
           ('stopped  `(started ,(apply <- shepherd 'start service writer args)))
           ('starting `(starting ,(<- service 'running)))
           ('stopping `(stopping ,(<- service 'running)))
           ('running  `(running ,(<- service 'running)))))))
    (stop
     (lambda args
       (apply <- shepherd 'stop service writer args)))))

(define c-registry (with-vat c-vat (spawn ^service-registry)))
(define c-shepherd (with-vat c-vat (spawn ^shepherd c-registry)))
(define c-web-server-a
  (with-vat c-vat
    (spawn ^exported-service a-output a-shepherd a-web-server '(web-server-a))))
(define c-web-server-b
  (with-vat c-vat
    (spawn ^exported-service b-output b-shepherd b-web-server '(web-server-b))))

These exported services are actually proxy objects that you can think of like micro-herd clients that only control a single service. In this simplistic example, Carol can only start or stop the exported services, but it would also be possible to allow other actions to be invoked.

To conveniently orchestrate all of the remote web-server services with a single command, Carol binds them together with her own local web-server-fleet service that depends on both web-server-a and web-server-b.

(define c-web-server-fleet
  (with-vat c-vat
    (spawn ^service '(web-server-fleet)
           #:requirement '(web-server-a web-server-b)
           #:start-handler (const #t)
           #:stop-handler (const #f))))
(with-vat c-vat
  (let-on ((_ (<- c-shepherd 'register c-web-server-a c-output))
           (_ (<- c-shepherd 'register c-web-server-b c-output))
           (_ (<- c-shepherd 'register c-web-server-fleet c-output)))
    (<- c-shepherd 'start c-web-server-fleet c-output)))

Now all Carol has to do is run herd start web-server-fleet (which we simulate above with the start method call) and her local Shepherd will report the success or failure of starting all the remote web servers in the fleet! Assembling the logs from all three machines, the event log would look something like this:

A: Service networking has been started.
B: Service networking has been started.
A: Service web-server has been started.
C: Service web-server-a has been started.
B: Service web-server has been started.
C: Service web-server-b has been started.
C: Service web-server-fleet has been started.

Neat, huh?

Guix deployment over OCapN

One final idea we’ll share is for a new Guix feature: a guix deploy agent. This would be a capability-safe take on the modern DevOps practice of deploying through dedicated agents instead of generic SSH. To make this work, there would be a guix-deploy Shepherd service that runs on the target machine with a special deploy action to start the deployment process. The workstation that is invoking guix deploy would receive a capability to that service, perhaps in sturdyref form, and associate it with a Guix machine declaration. That code might look something like this:

(define my-server
  (machine
    (operating-system my-os)
    (environment ocapn-environment-type)
    (configuration (machine-ocapn-configuration
                    (sturdyref "ocapn://pubkey.tcp-tls/s/swissnum?host=example.com&port=8888")
                    (system "x86_64-linux")))))

Any volunteers interested in building this?

Wrapping up

Porting Shepherd to Goblins has been a long time coming, but we’re starting to see encouraging results! If you’d like to discuss this blog post, help us make some of the ideas described above a reality, or talk about anything else Spritely related, consider joining our community forum!