Do you have some pointers where we can read more about weaknesses of ptrace syscall interception?
To me this seems like an improvement over having to worry about the full host syscall surface area.
Until nested hardware virtualization is broadly available, I cannot run things like clear containers on major cloud vendors, so I'm pretty excited to have a way to increase the isolation between containers ... well, unless you point me to something that shows that all this is moot.
> To me this seems like an improvement over having to worry about the full host syscall surface area.
Seccomp already permits this type of attack surface restriction, and Docker (with runc) already has a default seccomp whitelist. So by default you already get this.
> Do you have some pointers where we can read more about weaknesses of ptrace syscall interception?
The basic problem is that the policy runs in userspace and is thus more vulnerable than a kernel-side policy. It also has the downside that you don't get any of the contextual information the kernel has about a syscall if you just know the syscall being called (such as labels or namespaces or LSMs or access rights or what state the process is in or whether another process is doing something nasty or ...).
There's a reason that UML didn't overtake KVM in virtualization. Because it had a worrying security model, since the only thing stopping a process from seeing the host was another process on the host tricking it. Everyone I've talked to about UML has cited security as the main drawback.
> Seccomp already permits this type of attack surface restriction, and Docker (with runc) already has a default seccomp whitelist. So by default you already get this.
gVisor's doc addresses this with:
in practice it can be extremely difficult (if not impossible) to reliably define a policy for arbitrary, previously unknown applications, making this approach challenging to apply universally.
gVisor's Sentry process in fact uses seccomp to limit the syscalls it can make (and thus in worst case the guest process by tricking Sentry). Furthermore it uses an actual networking filesystem protocol (good old 9p) to encode the rest of the file-oriented system calls so that they get executed by a separate process.
This arrangement shuffles the wide part of the kernel API surface into the per-container "proxy kernel",
while requiring a very narrow (and controlled) API surface to the rest of the host.
This is pretty much the same kind of deal (although quantitatively and qualitatively different) that OS level virtualization employs: guest kernels have a very narrow API surface area to the underlying hypervisor (and thus with the rest of system).
> The basic problem is that the policy runs in userspace and is thus more vulnerable than a kernel-side policy.
Color me skeptical, but running things in the kernel-side don't strike me as necessarily less vulnerable or more trustworthy. The linux kernel is quite a complicated beast with a very wide internal API surface area and despite the age still moving forward at a quite interesting pace.
There is a significant amount of research in running kernels with significant portions in user space (see the whole L4 family), and IIRC the problem has always been more about performance and adoption rather than an inherent problem of user-space vs kernel-space.
> It also has the downside that you don't get any of the contextual information the kernel has about a syscall if you just know the syscall being called (such as labels or namespaces or LSMs or access rights or what state the process is in or whether another process is doing something nasty or ...).
which in this case seems perfectly reasonable since this is not a generic "transparent sandbox" solution that enhances the security of regular processes, but more of a "lightweight kernel" that runs processes.
For example imagine you have a good single process sandbox (e.g. NaCl or https://pdos.csail.mit.edu/~baford/vm/) that is able to fully offer all necessary services to the logical guest process and only require a single TCP connection to perform all it's input and output (trough which you can e.g. run a 9p protocol and thus implement arbitrary I/O patterns with willing parties). It's easy to define a seccomp ruleset that will enforce that the sandbox host does only this.
gVisor is something "like that", except it's able to execute unmodified docker workloads.
To me this seems like an improvement over having to worry about the full host syscall surface area.
Until nested hardware virtualization is broadly available, I cannot run things like clear containers on major cloud vendors, so I'm pretty excited to have a way to increase the isolation between containers ... well, unless you point me to something that shows that all this is moot.