Sunday, March 31, 2024

Where are the Supply Chain Safe Programming Languages?

Programming languages currently offer few defences against supply chain attacks where a malicious third-party library compromises a program. As I write this, the open source community is trying to figure out the details of the xz-utils backdoor, but there is a long history of supply chain attacks. High profile incidents have made plain the danger of shipping software built from large numbers dependencies, many of them unaudited and under little scrutiny for malicious code. In this post I will share ideas on future supply chain safe programming languages.

Supply Chain Safe Programming Languages?

I'm using the term Supply Chain Safe Programming Languages for languages that defend against supply chain attacks and allow library dependencies to be introduced with strong guarantees about what the dependencies can and cannot do. This type of programming language is not yet widely available as of March 2024, to the best of my knowledge.

Supply chain safety is often associated with software packaging and distribution techniques for verifying that software was built from known good inputs. Although adding supply chain safety tools on top of existing programming languages is a pragmatic solution, I think future progress requires addressing supply chain safety directly in the programming language.

Why today's languages are not supply chain safe

Many existing languages have a module system that gives the programmer control over the visibility of variables and functions. By hiding variable and functions from other modules, one might hope to achieve isolation so that a component like a decompression library could not read a sensitive variable from the program. Unfortunately this level of isolation between components is not really available in popular programming languages today even in languages with public/private visibility features. Visibility is more of a software engineering tool for keeping programs decoupled than an isolation mechanism that actually protects components of a program from each other. There are many ways to bypass visibility.

The fundamental problem is that existing programming languages do not even acknowledge that programs often consist of untrusted components. Compilers and interpreters currently treat the entire input source code as having more or less the same level of trust. Here are some of the ways in which today's programming languages fall short:

  • Unsafe programming languages like C, C++, and even Rust allow the programmer to bypass the type system to do pretty much anything.
  • Dynamic languages like Python and JavaScript have introspection and monkey patching abilities that allow the programmer to hook other parts of the program and escape attempts at isolation.
  • Build systems and metaprogramming facilities like macros allow untrusted components to generate code that executes in the context of another component.
  • Standard libraries provide access to spawning new programs, remapping virtual memory, loading shared libraries written in unsafe languages, hijacking function calls through the linker, raw access to the program's memory space with /proc/self/mem, and so on. All of these can bypass language-level mechanisms for isolating components in a program.

Whatever your current language, it's unlikely that the language itself allows you to isolate components of a program. The best approach we have today for run-time isolation is through sandboxing. Examples of sandboxing approaches include seccomp(2), v8 Isolates for JavaScript, invoking untrusted code in a WebAssembly runtime, or the descendents of chroot(2).

Sandboxes are not supported directly by the programming language and have a number of drawbacks and limitations. Integrating sandboxing into programs is tedious so they are primarily used in the most critical attack surfaces like web browsers or hypervisors. There is usually a performance overhead associated with interacting with the sandbox because data needs to be marshalled or copied. Sandboxing is an opt-in mechanism that doesn't raise the bar of software in general. I believe that supply chain safe programming languages could offer similar isolation but as the default for most software.

What a Supply Chain Safe Programming Language looks like

The goal of a supply chain safe programming language is to isolate components of a program by default. Rather than leaving supply chain safety outside the scope of the language, the language should allow components to be integrated with strong guarantees about what effects they can have on each other. There may be practical reasons to offer an escape hatch to unsafe behavior, but the default needs to be safe.

At what level of granularity should isolation operate? I think modules are too coarse grained because they are often collections of functions that perform very different types of computation requiring different levels of access to resources. The level of granularity should at least go down to the function level within a component, although even achieving module-level granularity would be a major improvement over today's standards.

An example is that a hash table lookup function should be unable to connect to the internet. That way the function can be used without fear of it becoming a liability if it contains bugs or its source code is manipulated by an attacker.

A well-known problem in programming language security is that the majority of languages expose ambient capabilities to all components in a program. Ambient capabilities provide access to resources that are not explicitly passed in to the component. Think of a file descriptor in a POSIX process that is available to any function in the program, including a string compare function that has no business manipulating file descriptors.

Capability-based security approaches are a solution to the ambient capabilities problem in languages today. Although mainstream programming languages do not offer capabilities as part of the language, there have been special-purpose and research languages that demonstrated that this approach works. In a type safe programming language with capability-based security it becomes possible to give components access to only those resources that they require. Usually type safety is the mechanism that prevents capabilities from being created out of thin air, although other approaches may be possible for dynamic languages. The type system will not allow a component to create itself a new capability that the component does not already possess.

Capability-based security addresses safety at runtime, but it does not address safety at compile time. If we want to compose programs from untrusted components then it is not possible to rely on today's build scripts, code generators, or macro systems. The problem is that they can be abused by a component to execute code in the context of another component.

Compile-time supply chain safety means isolating components so their code stays within their component. For example, a "leftpad" macro that pads a string literal with leading spaces would be unsafe if it can generate code that is compiled as part of the main program using the macro. Similarly, a build script for the leftpad module must not be able to affect or escape the build environment.

Macros, build scripts, code generators, and so on are powerful tools that programmers find valuable. The challenge for supply chain safe programming languages is to harness that power so that it remains convenient to use without endangering safety. One example solution is running build scripts in an isolated environment that cannot affect other components in the program. This way a component can take advantage of custom build-time behavior without endangering the program. However, it is unclear to me how far inter-component facilities like macros can be made safe, if at all.

Conclusion

I don't have the answers or even a prototype, but I think supply chain safe programming languages are an inevitability. Modern programs are composed of many third-party components yet we do not have effective techniques for confining components. Languages treat the entire program as trusted rather than as separate untrusted components that must be isolated.

Hopefully we will begin to see new mainstream programming languages emerge that are supply chain safe, not just memory safe!