ChaosDB Vulnerability

This is commentary on work done by the Wiz Research Team published here. You should read that article carefully before continuing. I was motivated to write this post because I felt like the incident provided a great example of how the theory of security best practices in software development related to the ground reality of how attackers infiltrate systems.

Trust Minimization

Bug #1 was the entry point to the attack: users were permitted to execute arbitrary C# code as root on Jupyter notebooks. This was probably a configuration error, judging by the fact that users otherwise executed coded as cosmosuser. There are two key takeaways from this point:

(1) Configuration errors are common in practice. Despite our best efforts to have zero errors, we can expect that in a sufficiently complex system, some errors will always creep in. Security best practices therefore have to operate in the context of a reality where errors permeate the system and parts of the system can break at any point. It is not enough to try to prevent errors; it is vital that we detect problems quickly and design the system in a way that limits the blast radius when things break.

Zero Trust is a term that refers to the extreme version of the same idea. In the security realm, trust is a bad thing: it means we expect the object of our trust to operate without error, which deviates from the reality we see around us (note that intentions don’t matter). We can avoid taking an ideological stance on this subject by accepting that if we want to build a robust system, we need to ensure that parts of the system don’t trust each other unless reasonably necessary.

(2) Systems that allow execution of arbitrary code most likely allow execution of arbitrary actions. This might seem like a tautological statement but it has deeper meaning. On one side of the equation, software developers might see arbitrary code execution systems as the epitome of loose coupling, a desirable property of the system. It means that parts of the system don’t need to know about each other, and can evolve mostly independently. On the other hand, no one really wants to allow arbitrary code to be run, because that would mean allowing the system to be taken over (for example, allowing operating system files to be modified).

In reality, we have a common objective of allowing the user of the system to perform specific authorized actions while limiting all others, though it’s not always clear ahead of time what actions need to be authorized. Our choices are (a) to identify what actions need to be authorized and allow just those, or (b) to identify what actions should be considered unauthorized and disallow just those. When people talk about supporting arbitrary code execution, they are, in fact, choosing (b) over (a) at some level within the virtual machine stack.

The trouble with approach (b) is that every level is ridden with errors and escape hatches, and it is rather difficult to plug all the holes, or even be aware of them in the first place. It may be best to avoid designing systems of this kind.

Blast Radius

Bug #2 allowed the root user to bypass firewall rules and gain access to forbidden network destinations. The problem with the configuration was that these firewall rules were set up on the Jupyter notebook container itself. As the article points out, a better alternative would have been to enforce these rules outside the Jupyter notebook container. This example demonstrates the importance of designing the system with a keen understanding of blast radius. A typical design process might start by asking:

Q1. Is network access restricted?
A1. Yes, via iptables rules.
Q2. Can a user bypass the iptables rules?
A2. No, they have to be root for that.

In practice, good security design needs to consider additional questions. For instance:

Q3. If the user becomes root, can we still limit the damage?
A3. Yes, we can implement the control outside the container.

It is often helpful to think of these controls as defenses. Defenses are designed to be robust, but they are not perfect, and can fail under clever and sustained attacks. When defenses fail, the system is weakened of course, but shouldn’t fail catastrophically. In parallel, failure of defenses should be monitored, and quick action should be taken to fortify the system.

Least Privilege Principle

Later on, the article points out another example of unexpectedly large blast radius:

“…we expected to get two keys: a private key and a public key used to encrypt and decrypt the protected settings. […] In reality, we got back 25 keys.

Could the impact have been reduced? For one thing, it isn’t clear if the service truly needed to vend all 25 keys to all clusters. It may also have been judicious to create distinct secrets for distinct purposes, and provide access to only the ones that the particular sub-system needed to do its job correctly. It’s easy to get lazy and assume that there is an ‘administrator’ with global access and super-powers, but this is recipe for disaster in a world where mistakes are inevitable.

Security Through Obscurity

The researchers were able to query a ‘certificates’ endpoint to fetch the secrets needed to intercept and gain access to hundreds of customer accounts. One aspect of this attack I’d call out is how easy it was for them to disassemble the WindowsAzureGuestAgent.exe file, and discover the certificate package format. This is something that software developers need to keep in mind as they develop robust software: the security of the system should never be contingent on attackers’ ignorance of system behavior or other knowledge (besides cryptographic secrets).

That’s all for today, folks! 🖖