Designing for Operator Control: Embedding a Network CLI in Your JVM-Based Web Service with Vert.x

This post briefly demos adding a diagnostic network interface to a JVM-based web service using Vert.x, a tool-kit for building reactive JVM applications.

From Dev to DevOps

When you move on from your new framework's tutorials to actually building a real web service, you begin to realize just how wide the gulf is between "cool toy" and "useful thing". You add features, optimizations, libraries—all in the service of building a piece of software which meets your product's functional needs.

Likewise, the service you developed for your laptop is often far simpler than the one you deploy and operate in production across tens, hundreds, or thousands of machines.

If it isn't, you should be asking yourself, "Why not?"

Software developers who have never operated a service at scale sometimes realize this late in the development process. Just as you evolve your service to handle functional needs, it needs to also handle operational needs: configuration, telemetry, logging, etc.

Today's focus: operator interfaces for running services. These are a cool bit of DevOps wizardry that often get overlooked. Most frequently used for service diagnostics, these are private interfaces for inspecting and sometimes modifying the state of a running service.

These can take many shapes: a private UI, REST endpoint, or even embedded network REPL. For simplicity, in this post we'll create a telnet interface with a simple custom command for inspecting a specific piece of state in the running system.

Know Thy Enemy

In the magical world of web service development, state is an essential enemy. Enemy, because the moment you introduce the notion of state management into your system, you've entered a thankless battle of managing correctness in the face of ever-increasing complexity.

"Consider a system to be made up of procedures, some of which are stateful and others which aren’t. We have already discussed the difficulties of understanding the bits which are stateful, but what we would hope is that the procedures which aren’t themselves stateful would be more easy to comprehend. Alas, this is largely not the case. If the procedure in question (which is itself stateless) makes use of any other procedure which is stateful — even indirectly — then all bets are off, our procedure becomes contaminated and we can only understand it in the context of state. If we try to do anything else we will again run the risk of all the classic state-derived problems discussed above. As has been said, the problem with state is that 'when you let the nose of the camel into the tent, the rest of him tends to follow'. As a result of all the above reasons it is our belief that the single biggest remaining cause of complexity in most contemporary large systems is state, and the more we can do to limit and manage state, the better." (Out of the Tar Pit)

But state is also essential, because a software system that isn't somewhere changing the state of the world is, well, useless—except perhaps for spinning your CPU into a frenzy to heat your room on those cold winter nights.

Why? Because Things Go Wrong at Scale

Most of the time, we solve the state management problem by simply creating stateless services. Of course, the state goes somewhere... but we push that work as far down to the authoritative store as possible.

The problem is, at some point the system you're working on is that authoritative store. It's writing bytes to disk, or orchestrating an in-progress job, or holding a lock on a sensor, or something.

And when things to wrong—and they will—you won't always be able to reproduce the issue elsewhere. Sometimes the conditions take days to trigger. Sometimes it takes a very specific combination of steps between actors that is more likely to happen in production.

Whatever the reason, sometimes the production box currently facing the issue is the only specimen available for you to diagnose. At that point, the only thing for you to do is remove the machine from the LB rotation and fire up those ops tools!

Network CLI Magic in Three Files

Here's an excerpt from a sample application showing how easy it is to add a telnet interface to a JVM application using Vert.x.

  • ShellVerticle.kt spins up the telnet interface.
  • WebVerticle.kt spins up a simple web server that modifies a bit of state with each request, and registers a command with the shell to reflect that state.
  • VertxPlaygroundMain.kt is the entrypoint for the application which mounts these two verticles.

We can see this in action:

From top to bottom:

  • Window 1 is running the application
  • Window 2 is sending a curl request to http://localhost:8080 every 2 seconds, slowly incrementing our request count
  • Window 3 is our interactive telnet session, where we run the request-count we registered in WebVerticle.kt. It echos back the count at the time the command is processed.

Danger and Responsibility

While the above demo application merely reads a value from the service, it's equally possible to write them. This power comes with a burden of responsibility. You generally don't want to be making out-of-band, volatile configuration changes—ones that wouldn't last a restart—to your service. Such changes should usually go through a proper change management system to ensure consistency and auditability across the entire service fleet—especially in the face of restarts and future deployments.

But such changes can be carefully applied in a diagnostic context. For example, it's reasonable to manually enable a backpressure mechanism so a downstream load balancer drains existing and future traffic from your server, letting you diagnose the live machine without risking customer impact.

Conclusion

I hope this preview sparked your imagination of what's possible.

  • Imagine being able to dynamically load and unload ephemeral workloads atop your service.
  • Imagine being able to manually issue commands to approve gated, potentially damaging operations.
  • Imagine being able to dynamically reconfigure your application to test failure scenarios in real time.
  • Imagine being able to tap into a live WebSocket connection to inspect and debug client-side state.

Vert.x makes adding operator interfaces to JVM applications trivially easy. Because Vert.x is implemented as a library rather than framework, adding it to an existing application is relatively simple regardless of your other technology choices. Once you have this in your tool-kit, you begin to see possibilities everywhere.