The First Bug on Mars: OS Scheduling, Priority Inversion, and the Mars Pathfinder

12 min readJun 13, 2022

Some debris from a Mars landing, as viewed by a Rover.

“It’s like The Man in the High Castle for Earth’s history.” — Rebecca Boyle, The Atlantic.

From its blood-like hue to its potential to sustain life, Mars has intrigued humankind since we first set eyes on this star-like object in the night sky that stands out from the other wanderers, in part because it is so obviously red, a ruddy, unblinking dot hanging with an air of menace.

The next planet over has been a fixture in myths dating to the Babylonians, who called it Nergal, after the god of destruction. This figure also appears in various forms in Greek, Norse, and Hindu mythologies. To the ancient Romans, the planet was symbolic of blood and war inspiring them to name it after their God of War. To ancient Chinese, it was Ying-huo, the Shimmering Planet. To many people today, the red planet may hold the key to a bright new future for humanity.

The story of Mars began about 4.5 billion years ago when gas and dust swirled together to form the fourth planet from the sun. Its reddish blue hue — which comes from oxidized iron, in the same chemical reaction that turns blood red — set the planet apart from its shimmering siblings, each compelling in its own way, but none other tracing a ruddy arch through Earth’s heavens. Its omnipresence in our sky made Mars a prime target as soon as we figured out how to use glass to make the night sky’s countenance appear larger thus in the late 1800s, a surface full of intriguing features was revealed by telescopes. Scientists back then ascribed these patterns and landforms to a bustling Martian civilization but we now know there are no artificial constructions on Mars. We’ve also learned that, until 3.5 billion years ago, the dry, toxic planet we see today might have once been as habitable as Earth. The planets rusty colour could be considered symbolic of the planets prime days long past.

Exploring Mars

“Once every 26 months, Earth and Mars are aligned in a way that minimizes travel times and expense, enabling spacecraft to make the interplanetary journey in roughly half a year.”

We have been attempting to land spacecraft on Mars since the 1960's, and almost all of them have been looking for life, in one way or another. Space programs from around the world have launched missions to the planet in an attempt to understand its past, present and potential for sustaining life. Early missions were flybys, with spacecraft snapping as many photos as they zoomed past. Later, probes pulled into orbit around Mars; more recently, landers and rovers have touched down on the surface.

But sending a spacecraft to Mars is hard, and landing on the planet is even harder. The history of Mars landings has shown that the planet is anything but hospitable. And that goes for machines just as much as microbes. The thin Martian atmosphere makes descent tricky, and more than 60 percent of landing attempts have failed. More than half of the robots sent to Mars have been destroyed in the process. Most recently, the European Space Agency’s Schiaparelli lander plummeted through the atmosphere in October of 2017, but crashed after it cut its parachute too soon and its hovercraft-like retrorockets didn’t fire long enough.

In 1971, the USSR delivered the first planetary rovers on skis to Mars, whose task was to puncture the surface with a rod — housing a dynamic penetrometer and a radiation densitometer — to see if Mars was solid or liquid dusty. The first probe crashed on November 27 while the second soft-landed on December 2 but did not manage to get out of the “shell” of the lander, thus that attempt did not count. NASA’s Viking 1 and 2 became the first spacecraft to successfully operate on the planet’s surface in 1976, returning photos until 1982. They also conducted biological experiments on Martian soil that were designed to uncover signs of life in space — but their results were inconclusive, and scientists still disagree over how to interpret the data. The first robot rover to land and travel on Mars was the Mars Pathfinder’s Sojourner Rover. It rolled onto Mars’ surface on July 6, 1997.

Mars Pathfinder Mission

NASA’s Mars Pathfinder Lander and the Sojourner Rover on Mars.

“Landing time for Pathfinder was 16:56:55 UT July 4, 1997, at 19 degrees 7 minutes 48 seconds north latitude and 33 degrees 13 minutes 12 seconds west longitude in Ares Vallis, about 12 miles (19 kilometers) southwest of the original target.” — NASA

Launched on 4th December 1996 and landing on Mars’ Ares Vallis on 4th July 1997, the Mars Pathfinder was an ambitious mission to send a lander and a separate remote-controlled rover to the surface of Mars. It was the second of NASA’s Discovery missions launched one month after Mars Global Surveyor and designed as a technology demonstration of a new way to deliver an instrumented lander and the first-ever robotic rover to the surface of the red planet.

Pathfinder was sent on a slightly shorter seven-month trajectory designed for arrival earlier. The main spacecraft included a 23-pound (10.5-kilogram) six-wheeled rover known as Sojourner capable of traveling about 545 yards (500 meters) from the main ship at top speeds of a little less than half-inch per second (1 centimeter per second). The little Sojourner operated on the surface of the planet for 85 days, and proved the technology for the rovers that followed. The mission was widely proclaimed as “flawless” in the early days after its July 4th, 1997 landing on an ancient flood plain in the planet’s northern hemisphere. Successes included its unconventional “landing” — bouncing onto the Martian surface surrounded by airbags, deploying the Sojourner rover, and gathering and transmitting voluminous data back to Earth, including the panoramic pictures that were such a hit on the Web.

In these days of constant, near real-time online information we’d have a front row seat following every nuance of the operation as it happened, but back in 1997, those watching with interest missed one of the mission’s behind the scenes dramas. A few days into the mission, not long after Pathfinder started gathering meteorological data, the spacecraft began experiencing total system resets, each resulting in losses of data. These failures were reported in terms such as “software glitches” and “the computer was trying to do too many things at once”.

CPU Scheduling

“The computer was trying to do too many things at once.”

Almost all computer programs have some alternating cycle of CPU number crunching and waiting for I/O of some kind. Even a simple fetch from memory takes a long time relative to CPU speeds. In a simple system running a single process, the time spent waiting for I/O is wasted, and those CPU cycles are lost forever.

In computing, scheduling is the method by which work is assigned to resources that are necessary to complete it. The work may be virtual computation elements such as threads, processes or data flows which are in turn scheduled on hardware resources such as processors, network links, or expansion cards. A scheduler is what carries out the scheduling activity.

A scheduling system allows one process to use the CPU while another is waiting for I/O, thereby making full use of otherwise lost CPU cycles. The challenge is to make the overall system as “efficient” and “fair” as possible, subject to varying and often dynamic conditions, and where “efficient” and “fair” are somewhat subjective terms, often subject to shifting priority policies. Whenever the CPU becomes idle, it is the job of the CPU Scheduler (a.k.a. the short-term scheduler ) to select another process from the ready queue to run next. The storage structure for the ready queue and the algorithm used to select the next process are not necessarily a FIFO queue. There are several alternatives to choose from, as well as numerous adjustable parameters for each algorithm.

Preemptive vs Non-Preemptive Scheduling

“In general, preemption means prior seizure-of.”

There are several different criteria to consider when trying to select the “best” scheduling algorithm for a particular situation and environment, including: CPU utilization, throughput, turn around time, waiting time, load average, response time.

In general one wants to optimize the average value of a criteria — maximize CPU utilization and throughput, and minimize all the others. However sometimes it may be desirable to do something different, such as to minimize the maximum response time. Sometimes it is most desirable to minimize the variance of a criteria than the actual value. Users are more accepting of a consistent predictable system than an inconsistent one, even if it is a little bit slower.

CPU scheduling decisions take place under one of four conditions:

When a process switches from the running state to the waiting state, such as for an I/O request or invocation of the wait( ) system call.
When a process switches from the running state to the ready state, for example in response to an interrupt.
When a process switches from the waiting state to the ready state, say at completion of I/O or a return from wait( ).
When a process terminates.

With conditions 1 and 4 there is no choice — a new process must be selected. With conditions 2 and 3 there is a choice — to either continue running the current process, or select a different one. If scheduling takes place only under conditions 1 and 4, the system is said to be non-preemptive, or cooperative. Under these conditions, once a process starts running it keeps running, until it either voluntarily blocks or until it finishes. Otherwise the system is said to be preemptive. Windows OS used non-preemptive scheduling up to Windows 3.x, and started using preemptive scheduling with Win95. Mac OS used non-preemptive prior to OSX, and preemptive since then.

Preemptive scheduling can cause problems when two processes share data, because one process may get interrupted in the middle of updating shared data structures. Preemption can also be a problem if the kernel is busy implementing a system call — e.g. updating critical kernel data structures — when the preemption occurs. Most modern UNIX based operating systems deal with this problem by making the process wait until the system call has either completed or blocked before allowing the preemption. Unfortunately this solution is problematic for real-time systems, as real-time response can no longer be guaranteed.

Priority Inversion Problem

“Priority inversion is a bug that occurs when a high priority task is indirectly preempted by a low priority task.”

When the high priority task at that instance seizes the currently running task, it is known as preemptive scheduling. Preemptive scheduling is priority based and only possible on hardware that supports a timer interrupt.

A challenging scheduling problem arises when a high-priority process gets blocked waiting for a resource that is currently held by a low-priority process. If the low-priority process gets preempted by one or more medium-priority processes, then the high-priority process is essentially made to wait for the medium priority processes to finish before the low-priority process can release the needed resource, causing a priority inversion. If there are enough medium-priority processes, then the high-priority process may be forced to wait for a very long time.

In the simple case, the high priority task (Task H) would be blocked as long as the low priority task (Task L) held the lock. This is known as “bounded priority inversion,” as the length of time of the inversion is bounded by however long the low priority task is in the critical section (holding the lock).

Unbounded priority inversion occurs when a medium priority task (Task M) interrupts Task L while it holds the lock. It’s called “unbounded” because Task M can now effectively block Task H for any amount of time, as Task M is preempting Task L (which still holds the lock).

Some of the most popular solutions to the priority inversion problem are: a priority-inheritance protocol and a priority ceiling level.

Under priority inheritance protocol, a low-priority process holding a resource for which a high-priority process is waiting will temporarily inherit the high priority from the waiting process. This prevents the medium-priority processes from preempting the low-priority process until it releases the resource, blocking the priority inversion problem.

Priority ceiling protocol involves assigning a “priority ceiling level” to each resource or lock. Whenever a task works with a particular resource or takes a lock, the task’s priority level is automatically boosted to that of the priority ceiling associated with the lock or resource. The priority ceiling is determined by the maximum priority of any task that needs to use the resource or lock.

Pathfinder Hardware & Software

The hardware architecture for Mars Pathfinder

“Even when you think you’ve tested everything that you can possibly imagine, you’re wrong” — Glenn E. Reeves (Pathfinder’s Software Team Leader)

The embedded computer on board the Sojourner rover was based around the 2 MHz Intel 80C85 CPU with 512 KB of RAM and 176 KB of flash memory solid-state storage, running a cyclic executive. The computer of the Pathfinder lander was a Radiation Hardened IBM RISC 6000 Single Chip (Rad6000 SC) CPU with 128 MB of RAM and 6 MB of EEPROM with Wind River VxWorks RTOS as its operating system.

The functionalities and equipment on Pathfinder were integrated using two buses: the VME bus which connected the CPU, radio (for communication) and the camera and the MIL-STD-1553 bus which connected the VME bus, the lander part and the cruse part. The lander part consisted of the ASI/MET, radar altimeter and accelerometers while the cruse part consisted of a sun scanner, a star scanner, valves, and thrusters. Communication to and from the CPU, radio and camera was done using the 1553 bus that connected the lander and cruse parts.

VxWorks RTOS provided a preemptive fixed-priority scheduling. Data from the lander was transmitted through the 1553 bus to the radio for onward transmission back to Earth. On the other hand, command signals from the CPU moved through the 1553 bus to the cruise and the lander. The RTOS employed a cyclic scheduler clocked @ 8 Hz rate — i.e., the same schedule is repeated in every 0.125 seconds. The management of 1553 bus was implemented as two important and critical tasks:

bc_sched: bus scheduler task that decided who would transmit and transmits the schedule for the next cycle.
bc_dist: bus distribution task decides who would receive.

There were other tasks, such as, the “communication_task” for the radio that transmited data back to earth and the science function “ASI/MET task”. The fixed priority ordering that was statically assigned by the engineers had the following relationship:

prio(“bc_sched”) > prior(“bc_dist”) > prio(“communication_task”) > prio(“ASI/MET_task”)

Using the watchdog timer, the bc_sched task checked at the beginning of its execution, whether the bc_dist task had completed its execution in the previous cycle. As it turned out, this critical test was violated on Pathfinder’s VxWorks RTOS causing resets every time Pathfinder started collected weather data. After debugging on the Pathfinder replica at JPL, engineers discovered that the cause of this malfunctioning was a priority inversion problem where the higher priority bc_dist task was blocked by the much lower priority ASI/MET task whose solution was priority inheritance.

The priority inheritance had been disabled — the flag for the mutex was set to “off” — in VxWorks RTOS pipe semaphores for performance reasons. The on-board software was updated from earth and semaphore parameters were changed to enable priority inheritance. The system was tested for impact on performance, change in behaviour, and for other possible anomalies. Engineers at River Works concluded that the performance impact would be minimal and that the behavior of the system would not change. VxWorks contained a C language interpreter to execute statements on the fly during debugging. The JPL engineers decided to launch the spacecraft with this feature still enabled. A short C program was uploaded to the spacecraft, which when interpreted, changed the values of the mutex flag for priority inheritance from false to true which marked the mitigation of the resetting problem.