Slinging Mud: "Graceful" reboots in a Go world

I participated in MUDJAM on Itch.io last year around this time, putting together a very simple, low feature deliverable, BerkMud. If the title doesn't tell you, it was How to Train Your Dragon themed. It wasn't my first foray into building a mud but I wanted to do something new so I did it in golang which was completely new to me.

A year later, I got to thinking that there was a lot I had still wanted to do with it. One of those things was figuring out how some muds I used to play (DarkCastle) were able to reboot/upgrade the server without killing people's connections.

Each player in a mud has their own permanent socket connection to the server. This socket is the sole medium from which the entire game is played. Normally when killing a process, all of the open sockets are closed, so rebooting the server to pick up any code changes would force all current players to have to reconnect. Hot-rebooting, or now commonly associated with the word "grace" (as you'll see below), solves this problem. But how?

My initial thought process was to use a proxy server that would handle all external socket connections and pass data to-and-from client and server. This seemed like the obvious solution but also a very convoluted one. After thinking a while on how I would implement it, I did some research into hot-rebooting to see if that's how people actually did it. It wasn't; there's a much easier way - socket inheritance.

I found four really helpful sites that gave me enough to go on to get hot-rebooting working in a night. Eerily, three out of the four all used the exact same word, "grace", to refer to the process. It's probably not actually strange because "graceful" is a commonly used adjective in CS.

I'll skip over the trail of realization and jump into the nitty gritty of how it works - which is mostly me paraphrasing from Erwin's Hot Reboot snippet.

  1. Spawn a new process inheriting file descriptors
  2. Re-establish connections with the inherited file descriptors
  3. Link connections to original players

Seemed easy enough. Not having used go for a year was the most challenging part to getting everything to work. The gist of it looks like the following code.

Rebooting the server

// Create a temporary file to pass player/fd relations
tempFile, err := ioutil.TempFile("", "reboot-")
file, err := this.listener.(Filer).File()
// TODO : Handle file error gracefully

// Build the argument list for the new spawned process, passing the number of file descriptors inherited and the hot reboot file
args := []string{
	"-hotreboot",
	fmt.Sprintf("-fds=%d", len(this.connectionByPlayers)),
	fmt.Sprintf("-rebootFile=%s", tempFile.Name()),
}
// TODO : Hardcoded path to executable
path := "/Users/berdon/workspace/berk/berk"

// Setup the exec command
cmd := exec.Command(path, args...)
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

// Iterate through each connected player writing out the player/fd relationship and building a list of os.Files to inherit
fdIndex := 0
files := make([]*os.File, 0, len(this.connectionByPlayers))
for  player, value := range this.connectionByPlayers {
	tempFile.WriteString(fmt.Sprintf("%d %d\r\n", player.Id(), fdIndex))
	fdIndex++

	file, err := value.Connection.(Filer).File()
	// TODO : Handle error case

	files = append(files, file)
}
tempFile.Close();

// Add the inherited files
cmd.ExtraFiles = append([]*os.File{file}, files...)

// Start the new process
err = cmd.Start()
if err != nil {
	log.Fatalf("Hot Reboot: Failed to launch, error: %v", err)
}

Hopefully the comments speak for themselves.

The general flow is to create a temporary file that is used to pass which file descriptor/player mappings to the next process. Then it spawns a new process passing the file descriptors, mapping file, and the number of descriptors to the new process. Everything else (resuming state, killing the old process) is left up to the new process.

Initializing the server's state

// Resume the listener socket (starts at 3 + i because stdin, stdout, and stderr take up 0, 1 and 2)
fd := os.NewFile(3, "")
listener, err := net.FileListener(fd)
// TODO : Handle error case

// Close the fd (we don't need it anymore)
if err := fd.Close(); err != nil {
	log.Fatal(err)
	return
}

// Load the reboot file
log.Printf("Reboot file at %s", rebootFilePath)
rebootFile, err := os.Open(rebootFilePath)
defer rebootFile.Close()
// TODO : Handle error case

// Read in the player/fd data
scanner := bufio.NewScanner(rebootFile)
scanner.Split(bufio.ScanLines)

// Make a lookup table
fdPlayerMap := make(map[int](int64))
for scanner.Scan() {
	values := strings.Split(scanner.Text(), " ")

	idValue, _ := strconv.Atoi(values[0])
	fdValue, _ := strconv.Atoi(values[1])
	fdPlayerMap[fdValue] = int64(idValue)
}

// Resume individual connections
for i := 0; i < fds; i++ {
	fd := os.NewFile(uintptr(4 + i), "")
	conn, err := net.FileConn(fd)
	err = fd.Close()
	// TODO : Handle error case

	// Load player state, resume player code...
}

// Kill the original parent process
parent := syscall.Getppid()
syscall.Kill(parent, syscall.SIGTERM)

The new process, if in hot reboot "mode", reconnects to the old listener socket, iterates through the mapping file and inherited file descriptors reestablishing connections, and then kills the old server. All very straight forward.

It's something simple but I can't adequately describe how gratifying it was to see:

HP: 10/10, MP: 10/10, MV: 10/10> reboot

HP: 10/10, MP: 10/10, MV: 10/10> 
Your body rematerializes.

HP: 10/10, MP: 10/10, MV: 10/10>
Show Comments