Shepherd × Goblins update
Juli Sims & David Thompson —The Shepherd is an init system and process manager initially built for GNU Hurd and now used by Guix. It can run with either root or user privileges to launch daemons, execute tasks, and manage processes. As we’ve discussed previously, Spritely has been working to port the Shepherd to Goblins. We’ve been a bit quiet since that announcement, so what’s the buzz?
First, as a quick refresher, the Shepherd is a great project to port to Goblins because it’s already built on the actor model. By switching to Goblins, we can bring the following benefits to the project:
- Streamline the codebase by replacing the Shepherd’s ad-hoc actor model implementation.
- Reduce the likelihood of concurrency bugs caused by the existing actor model implementation that exposes too much of its CSP foundation using Fibers.
- Transform services (and other actors) into object capabilities for fine-grained management of privileges, which will (eventually) make it to possible to unify the currently separate worlds of “system” Shepherds that run as root under PID 1 and “user” Shepherds that run as an unprivileged user.
- Enable Shepherd to use the Object Capability Network (OCapN) to open the door for distributed networks of “Communicating Shepherd Processes” in the future.
Since our last post, we’ve done the following:
- Wrote Goblins versions of the core actors like the service controller, service registry, and process monitor.
- Added unit tests for all of the core actors (there were none before).
- Rewrote the public API as a compatibility layer on top of a new Goblins actor API. This new API is private for now.
What this means is that all of the extant Shepherd functionality will soon be available in the Goblins port. We’re currently working out the remaining sneaky, tricky, subtle bugs in order to have a full 1:1 port that passes the existing test suite. We’re getting very close, so we felt it was time to share this update!
What we’ve been up to
To better explain the work we've been doing, we need to discuss some Shepherd internals, particularly how actors work. Shepherd includes an ad-hoc actor model that has notable differences with Goblins. Shepherd actors are implemented as an event loop running in a fiber (a lightweight thread) and send messages to each other over channels. In contrast, Goblins is also built on Fibers but mostly hides this behind an abstraction barrier. Rather than each actor managing its own event loop, many Goblins actors share an event loop known as a vat. Each Goblins actor is mapped to its current “behavior”, a procedure that is called when the actor receives a message. The differences between these two actor model implementations means that porting an actor from one to the other isn’t as straightforward as it might seem.
The central unit of abstraction is the service, which represents
something managed by the Shepherd. This could be an external process,
a one-off task (known as a “one-shot” service), a timer, or whatever
the user wants it to be. Internally, services are represented by a
record holding immutable configuration information and a service
controller actor which manages the running state. The service
controller (named ^service
in the code) was the first target of our
porting efforts as it helped elucidate the shape of the new
architecture.
Going hand-in-hand with the service controller actor is the service
registry (^service-registry
). The service registry is responsible
for mapping the names of services to their associated service
controller. Porting the service registry was one of the simplest
parts of the project; most of the original actor logic was copied over
with minimal changes.
The Shepherd makes heavy use of dynamically scoped variables known as parameters to pass around shared state like the current registry, the current service, the current client socket, etc. This created an issue for us, however, as vats introduce a continuity barrier for parameters. In the existing actor system, actors inherit the dynamic environment in which they are spawned because each actor is a new fiber spun off the current fiber. In Goblins, actors are spawned within a vat’s event loop which has an entirely separate dynamic state from the caller. Furthemore, Goblins discourages the use of parameters because they are inherently ambient and thus not capability-safe.
Removing these parameters would be a backwards incompatible change, so
instead we capture the current state of relevant parameters in the
compatibility layer before passing those values off to Goblins actors.
The most obvious use of this technique is for I/O handling. The
Shepherd uses a custom soft port for logging to standard output, a
client socket, and/or the system log. A little named actor
tentatively named ^writer
handles these concerns now.
The most significant change is the introduction of a coordinating
actor called simply ^shepherd
. This actor is where we pushed all of
the logic related to starting, stopping, respawning, etc. Procedures
such as start-service
and stop-service
are now thin wrappers
around calls to this actor.
Related to process orchestration, the Shepherd has a process monitor
actor, whose job is to watch for the termination of processes
associated with services and notify other actors about it. This was
also relatively simple to port to a Goblins ^process-monitor
actor
once all of the shared state was properly captured. Goblins
promises
somewhat simplified the logic involved in responding to these changing
states.
Perhaps the trickiest part of the port was logging. Loggers read from
an input port and write timestamped log lines to some destination,
perhaps a file or the system log. Port I/O requires some
understanding how Fibers and Goblins interact. If you’re not careful,
you can suspend a vat’s fiber, potentially stalling the program.
Goblins provides the ^io
actor
to handle many common I/O needs safely, but the needs of Shepherd were
beyond what that actor could provide. A first attempt at porting this
logic proved too buggy to rely upon, so we’ve recently reworked
logging actors into something more robust (and more similar to the
original logging actors, too).
Finally, we have made a variety of smaller changes so existing code
plays nicely with new Goblins actors. For example, the Shepherd
provides a collection of helpers for things like starting and stopping
processes (make-forkexec-constructor
, for example). The shift to
Goblins required the introduction of Goblins message passing and
promise handling to keep some of these working as expected. A lot of
time has been spent devising ways to keep the public API the same so
that existing user code will continue to function as expected, as if
nothing has really changed. To support all of this work, we’ve
introduced a few of our own helper procedures and macros, and we’ve
modified some existing ones to be Goblins-friendly.
Whew, that’s a lot! It’s the culmination of over a year of work, so it can be difficult to take in. If you’d like to try, you can see the current state of the port in our WIP pull request on Codeberg!
Demo time
Okay, but does it work? We’re so glad you asked!
In addition to using Shepherd for its init system (PID 1), Guix
provides helpful facilities for running user-level Shepherd daemons
through the
home-shepherd-service-type
in guix home
.
This is the same kind of user Shepherd daemon mentioned before; Guix
just provides a nice, declarative interface to configure and launch
the daemon when defining a user's home-environment
. We used this
functionality to swap in our Goblins Shepherd and manage an Emacs
background daemon with it. Here's an actual session running in the
context of guix home container
:
juli shepherd λ guix home container home-shepherd.scm
substitute: recherche des substituts sur « https://substitutes.nonguix.org »… 100.0%
substitute: recherche des substituts sur « https://bordeaux.guix.gnu.org »… 0.0%guix substitute: avertissement : bordeaux.guix.gnu.org : la connexion à échouée : Connexion refusée
substitute:
substitute: recherche des substituts sur « https://ci.guix.gnu.org »… 100.0%
Les dérivations suivantes seront compilées :
/gnu/store/ixpviqjakf55j24ag523pdl5g9k8xld7-provenance.drv
/gnu/store/ynykwmi237h4jxrgdgkwqs4sgvf1h3cc-home.drv
substitute: recherche des substituts sur « https://bordeaux.guix.gnu.org »… 0.0%
construction de /gnu/store/ixpviqjakf55j24ag523pdl5g9k8xld7-provenance.drv...
construction de /gnu/store/ynykwmi237h4jxrgdgkwqs4sgvf1h3cc-home.drv...
WARNING: (guile-user): imported module (guix build utils) overrides core binding `delete'
WARNING: (guile-user): imported module (guix build utils) overrides core binding `delete'
Symlinking /home/juli/.bash_profile -> /gnu/store/9vidh7q8sp353rb1jnrndyif9wl2fjna-bash_profile... done
Symlinking /home/juli/.profile -> /gnu/store/jjvk66x9wwzxw38byk796y9b6kvi21b0-shell-profile... done
Symlinking /home/juli/.bashrc -> /gnu/store/mdp6zf77631kqr8cw26p4m3vvbr7vk01-bashrc... done
Symlinking /home/juli/.config/shepherd/init.scm -> /gnu/store/nag703p683l66s2adad719810xfrhx3w-shepherd.conf... done
Symlinking /home/juli/.config/fontconfig/fonts.conf -> /gnu/store/bqhrpq7na79bxm3sbpmnana10g6sc4d5-fonts.conf... done
done
Finished updating symlinks.
Comparing /gnu/store/non-existing-generation/profile/share/fonts and
/gnu/store/yyyn7zy4lx8z9qsb41imkbxb11wrrqqc-home/profile/share/fonts... done (same)
Evaluating on-change gexps.
On-change gexps evaluation finished.
juli@sordidus ~$ herd status
Started:
+ emacs
+ root
juli@sordidus ~$ herd status emacs
Status of emacs:
It is running since 19:40:05 (9 seconds ago).
Main PID: 40
Command: /gnu/store/b6f34g5rsz35z40fc0myimw9zgj654xj-emacs-no-x-30.1/bin/emacs --fg-daemon
It is enabled.
Provides: emacs
Will be respawned.
Recent messages (use '-n' to view more or less):
2025-09-05 19:40:06 Starting Emacs daemon.
juli@sordidus ~$ emacsclient -c
juli@sordidus ~$ herd stop emacs
juli@sordidus ~$ herd status emacs
Status of emacs:
It is stopped since 19:40:31 (2 seconds ago).
Process exited with code 15.
It is enabled.
Provides: emacs
Will be respawned.
juli@sordidus ~$ herd start emacs
Service emacs has been started.
juli@sordidus ~$ herd status emacs
Status of emacs:
It is running since 19:40:38 (2 seconds ago).
Main PID: 144
Command: /gnu/store/b6f34g5rsz35z40fc0myimw9zgj654xj-emacs-no-x-30.1/bin/emacs --fg-daemon
It is enabled.
Provides: emacs
Will be respawned.
Recent messages (use '-n' to view more or less):
2025-09-05 19:40:38 Starting Emacs daemon.
juli@sordidus ~$ herd restart emacs
Service emacs has been started.
juli@sordidus ~$ herd status emacs
Status of emacs:
It is running since 19:40:45 (2 seconds ago).
Main PID: 180
Command: /gnu/store/b6f34g5rsz35z40fc0myimw9zgj654xj-emacs-no-x-30.1/bin/emacs --fg-daemon
It is enabled.
Provides: emacs
Will be respawned.
Recent messages (use '-n' to view more or less):
2025-09-05 19:40:45 Starting Emacs daemon.
juli@sordidus ~$ ls -al $(command -v herd)
lrwxrwxrwx 1 65534 overflow 80 Jan 1 1970 /home/juli/.guix-home/profile/bin/herd -> /gnu/store/4l4b2qb91bq3djj9ldg66jx6p98hxvin-goblins-shepherd-1.0.99-git/bin/herd
As simple as this demo is, it demonstrates that the Goblins Shepherd can already handle its basic job, despite the work left to achieve parity with mainline. If you’d like to try this out for yourself, first ensure you have Guix installed and up-to-date, then run the following commands:
git clone https://codeberg.org/spritely/shepherd
cd shepherd
git checkout -b goblins-shepherd-guix-home-demo
guix home container home-shepherd.scm
You can also look on Codeberg to see the home config itself.
“That’s so exciting!” we hear you saying; “When will this be shipping in a Guix distribution near me? When can I use Goblins to boot my operating system?” As exciting as this is, we’re not quite ready for prime time, as we’ll explain below.
Remaining work
Before we deploy Shepherd onto our own systems, and especially before we try it in PID 1, we want to ensure that we reliably pass Shepherd’s existing suite of shell-based tests. Somewhere between 4 and 7 tests fail as of writing; some tests fail every time, others only intermittently, indicating there may be some subtle race conditions lurking.
One key component that remains unsupported is system log support, the lack of which accounts for a considerable chunk of the remaining test failures. Nonetheless, our branch is passing nearly all of the existing tests, which is great progress! Once the test suite issues have been sorted out, we’ll try using our Shepherd build on a real Guix system and see how stable it is over time.
There is also plenty of code to clean up. We've left all the original actor code in place during development to make rebasing on upstream less prone to conflicts, but the time has come to start removing it. There are also numerous refactors that can be done to improve the code style and readability.
Beyond a direct port, though, this work will empower the Shepherd with everything Goblins and OCapN have to offer. So how could those powers be used? Well, we've got some ideas!
Single-system unification
On Guix systems, where Shepherd serves as the init system in PID 1, it
is common to run additional Shepherd instances for unprivileged users.
One option is to use guix home
, as mentioned above. These Shepherd
instances are entirely separate from each other. Only users with
access to the root user (likely via sudo
) can interact with the
system Shepherd; it’s all-or-nothing. It would be nice to be able to
give unprivileged users access to a subset of the system services,
following the principle of least authority. The object capability
security model provided by Goblins makes this possible!
herd
clients communicate with shepherd
daemons using a custom
protocol over a Unix domain socket. If the client were to be modified
to speak the OCapN protocol instead, users of herd
would only be
able to interact with services for which they hold a capability.
Consider a shared server: the system administrator could give
capabilities to other users of the machine that grant access to just a
subset of the available system services — and perhaps to only a
subset of the available service actions. The fine-grained nature of
object capabilities means that access can be scoped to the minimum
necessary for each user to do what they need.
As a first step in this direction, we’ve added a prerequisite component to Goblins, a Unix domain socket netlayer for OCapN. This is necessary to have interconnected machines running Shepherd communicate over OCapN at the system layer. Read on to see an example of what this might look like!
Fleet orchestration
Moving on from Shepherd on a single machine, an OCapN-enabled Shepherd
will allow for orchestration of entire server fleets. To
demonstrate, let’s walk through an example scenario. Carol, a DevOps
engineer, is responsible for the web servers running on a small fleet
of machines named A and B. Each machine is running Shepherd with a
web-server
service registered. To model this scenario on a single
machine, we’ll use three Goblins vats:
(define a-vat (spawn-vat #:name "Server A"))
(define b-vat (spawn-vat #:name "Server B"))
(define c-vat (spawn-vat #:name "Carol"))
Then we’ll setup some loggers to distinguish which “machine” logged which line:
(define-actor (^prefix-logger bcom prefix)
(lambda (str)
(format #t "~a: ~a\n" prefix str)))
(define a-output (with-vat c-vat (spawn ^prefix-logger "A")))
(define b-output (with-vat c-vat (spawn ^prefix-logger "B")))
(define c-output (with-vat c-vat (spawn ^prefix-logger "C")))
Servers A and B have identical configuration with a web-server
service that depends on the networking
service:
(define (spawn-networking-service)
(spawn ^service '(networking)
#:start-handler (const #t)
#:stop-handler (const #f)))
(define (spawn-web-server-service)
(spawn ^service '(web-server)
#:requirement '(networking)
#:start-handler (const #t)
#:stop-handler (const #f)))
(define a-registry (with-vat a-vat (spawn ^service-registry)))
(define a-shepherd (with-vat a-vat (spawn ^shepherd a-registry)))
(define a-networking (with-vat a-vat (spawn-networking-service)))
(define a-web-server (with-vat a-vat (spawn-web-server-service)))
(with-vat a-vat
(all-of (<- a-shepherd 'register a-networking a-output)
(<- a-shepherd 'register a-web-server a-output)))
(define b-registry (with-vat b-vat (spawn ^service-registry)))
(define b-shepherd (with-vat b-vat (spawn ^shepherd b-registry)))
(define b-networking (with-vat b-vat (spawn-networking-service)))
(define b-web-server (with-vat b-vat (spawn-web-server-service)))
(with-vat b-vat
(all-of (<- b-shepherd 'register b-networking b-output)
(<- b-shepherd 'register b-web-server b-output)))
Carol would like to issue a single command to start or stop all of the
web servers. To do this, Carol first acquires references to the
web-server
service actors on each machine. At first glance this
might seem to cause a name collision problem as both services have the
same name, but fear not! Carol can assign locally meaningful names
to these remote services in her local Shepherd. On Carol’s machine,
the remote services are registered as web-server-a
and
web-server-b
, respectively.
;; Naive, but enough for demo purposes.
(define-actor (^exported-service bcom writer shepherd service provision)
(extend-methods service
((canonical-name) (car provision))
((provision) provision)
((requirement) '())
(start
(lambda args
(let-on ((status (<- service 'status)))
(match status
('stopped `(started ,(apply <- shepherd 'start service writer args)))
('starting `(starting ,(<- service 'running)))
('stopping `(stopping ,(<- service 'running)))
('running `(running ,(<- service 'running)))))))
(stop
(lambda args
(apply <- shepherd 'stop service writer args)))))
(define c-registry (with-vat c-vat (spawn ^service-registry)))
(define c-shepherd (with-vat c-vat (spawn ^shepherd c-registry)))
(define c-web-server-a
(with-vat c-vat
(spawn ^exported-service a-output a-shepherd a-web-server '(web-server-a))))
(define c-web-server-b
(with-vat c-vat
(spawn ^exported-service b-output b-shepherd b-web-server '(web-server-b))))
These exported services are actually proxy objects that you can think of like micro-herd clients that only control a single service. In this simplistic example, Carol can only start or stop the exported services, but it would also be possible to allow other actions to be invoked.
To conveniently orchestrate all of the remote web-server
services
with a single command, Carol binds them together with her own local
web-server-fleet
service that depends on both web-server-a
and
web-server-b
.
(define c-web-server-fleet
(with-vat c-vat
(spawn ^service '(web-server-fleet)
#:requirement '(web-server-a web-server-b)
#:start-handler (const #t)
#:stop-handler (const #f))))
(with-vat c-vat
(let-on ((_ (<- c-shepherd 'register c-web-server-a c-output))
(_ (<- c-shepherd 'register c-web-server-b c-output))
(_ (<- c-shepherd 'register c-web-server-fleet c-output)))
(<- c-shepherd 'start c-web-server-fleet c-output)))
Now all Carol has to do is run herd start web-server-fleet
(which we
simulate above with the start
method call) and her local Shepherd
will report the success or failure of starting all the remote web
servers in the fleet! Assembling the logs from all three machines,
the event log would look something like this:
A: Service networking has been started.
B: Service networking has been started.
A: Service web-server has been started.
C: Service web-server-a has been started.
B: Service web-server has been started.
C: Service web-server-b has been started.
C: Service web-server-fleet has been started.
Neat, huh?
Guix deployment over OCapN
One final idea we’ll share is for a new Guix feature: a guix deploy
agent. This would be a capability-safe take on the modern DevOps
practice of deploying through dedicated agents instead of generic SSH.
To make this work, there would be a guix-deploy
Shepherd service
that runs on the target machine with a special deploy
action to
start the deployment process. The workstation that is invoking guix deploy
would receive a capability to that service, perhaps in
sturdyref
form, and associate it with a Guix machine
declaration. That code
might look something like this:
(define my-server
(machine
(operating-system my-os)
(environment ocapn-environment-type)
(configuration (machine-ocapn-configuration
(sturdyref "ocapn://pubkey.tcp-tls/s/swissnum?host=example.com&port=8888")
(system "x86_64-linux")))))
Any volunteers interested in building this?
Wrapping up
Porting Shepherd to Goblins has been a long time coming, but we’re starting to see encouraging results! If you’d like to discuss this blog post, help us make some of the ideas described above a reality, or talk about anything else Spritely related, consider joining our community forum!