Since I see that some of the developers are in this thread, I'll post my question to them here.
What are your plans to deal with the overhead and nastiness of ptrace? Beyond the performance losses, there's also the annoyance that you can't ptrace a single task twice, so no debuggers.
Are you familiar with the FlexSC paper? Have you considered using a FlexSC-like RPC interface (a secure one, of course) to achieve your syscall interception, instead of ptrace? That would allow you to not just match the performance of native system calls, but even theoretically exceed their performance, while still having the same level of control. (I have been working on such an approach, so I was excited to see this gvisor project posted - I hoped you might have already done this and saved me some work :))
Not sure how far this project can go if it sticks with ptrace...
To correct one misconception, the project is not bound to ptrace. There is a generic platform interface, and the repository also includes a KVM-based platform (wherein the Sentry acts as guest kernel and host VMM simultaneously) described in the README. The default platform is ptrace so that it works out of the box everywhere.
> What are your plans to deal with the overhead and nastiness of ptrace? Beyond the performance losses, there's also the annoyance that you can't ptrace a single task twice, so no debuggers.
It's true that you can't trace the sandbox itself, and that's annoying, but you can still use ptrace inside the sandbox (ptrace is implemented by the Sentry). Just wanted to make sure that was clear.
> Are you familiar with the FlexSC paper? Have you considered using a FlexSC-like RPC interface (a secure one, of course) to achieve your syscall interception, instead of ptrace? That would allow you to not just match the performance of native system calls, but even theoretically exceed their performance, while still having the same level of control. (I have been working on such an approach, so I was excited to see this gvisor project posted - I hoped you might have already done this and saved me some work :))
I am familiar with FlexSC. There are certainly opportunities for improvement, including kernel-hooks, shared regions for async system calls, etc. Given the pace that this space is evolving, our priority was to share these pieces so that we can discuss things in the open. While I don't think we'll be able to save you work (sorry!), we're aiming for collaboration and cross-fertilization.
Good to have a gVisor developer on here. At Dropbox one way we use secure containers is to run machine learning models. Do you know if TensorFlow works in a gVisor container? How is GPU support in the container? If running on a CPU, are BLAS libraries supported to speed up matrix math in the container? Finally, do you know if OpenCV currently runs in gVisor containers?
Tensorflow security geek here: TF works in gVisor. I strongly recommend a VM solution when dealing with GPUs. Haven't tried GPU in gVisor because the size of the attack surface against the GPU device driver is so large. Depends on your level of paranoia, though.
For CPU, gVisor is great.
You can absolutely use BLAS libraries. Eigen (the TF default) works for sure. I don't see any reason MKL wouldn't, but I haven't personally tested it.
Sadly, at least without violating NVidia's license agreement, you can't without a Tesla-series GPU. The consumer drivers won't let you put them in a VM using an IOMMU, which is the "right" way to do it.
Your main options are a Tesla P40 if you've got six grand to drop, which you can happily stuff in a VM and use the IOMMU to guarantee isolation for, or trying to massively optimize for your host CPU. Fortunately, with a lot of inference tasks, CPU isn't too bad. If you can get your batch sizes up, using MKL or MKL-DNN is a quite reasonable option on a decent Intel CPU. Installing TF with MKL-DNN is pretty easy these days, too (https://www.tensorflow.org/performance/performance_guide#ten... ). That would honestly be my first try if you need serious isolation on a budget.
You can map a GPU into a general container, it's just... not very good security. Any flaw in the (very large, complex) nvidia binary blob will leave you exposed to potential container escapes (or outright root). This really depends on your threat model. Putting it in a container is better than handing someone the root ssh keys to your server, it's just not very satisfying if you're serious about security.
If you're just accepting untrusted, e.g., CSV or image input from a customer and then running your own trusted model on it, you could half-a## it and do the format decode and some validation and initial processing in a gVisor sandbox, and then pass that to your own trusted process that has GPU access. It doesn't protect you against all exploits, and if you were doing it at Google, I'd tell you not to do that, but if you have less to lose it may be an acceptable middle ground. It's kinda complicated, though, and would incur decent data copy overhead
This is one area where using a cloud hosted service is appealing, 'cause they've bought the datacenter GPUs or TPUs or whateverPUs for you and handled the isolation story. (Disclaimer - as you probably gathered, I also work part time at Google, frequently on tensorflow and cloudml security, but this is all my opinion.)
> The consumer drivers won't let you put them in a VM using an IOMMU, which is the "right" way to do it.
For Windows guests at least, the workaround is pretty easy. You just prevent KVM from identifying itself. I have a gaming machine with a GTX1080Ti running in KVM at home.
> To correct one misconception, the project is not bound to ptrace. There is a generic platform interface, and the repository also includes a KVM-based platform (wherein the Sentry acts as guest kernel and host VMM simultaneously) described in the README. The default platform is ptrace so that it works out of the box everywhere.
Doesn't have even worse perf than ptrace? The time I tried to do that, I ended up getting bit by the rough 4 contexts switches to get most anything done. Guest->host_kernel->host_user_vmm->host_kernel->guest
Or are you just saying that if/when some third option that doesn't suck as much comes out, you'd be able to hopefully transparently switch to it?
You are right in that ptrace is slow and nasty. The key problem, I believe, is the tracer-tracee mode that involves two host processes and the switch is asynchronous (ptrace(SYSEMU) and then waitpid).
We do have the KVM platform that offers the synchronous switch, which performs better if you have bare-metal virtualization support.
What are your plans to deal with the overhead and nastiness of ptrace? Beyond the performance losses, there's also the annoyance that you can't ptrace a single task twice, so no debuggers.
Are you familiar with the FlexSC paper? Have you considered using a FlexSC-like RPC interface (a secure one, of course) to achieve your syscall interception, instead of ptrace? That would allow you to not just match the performance of native system calls, but even theoretically exceed their performance, while still having the same level of control. (I have been working on such an approach, so I was excited to see this gvisor project posted - I hoped you might have already done this and saved me some work :))
Not sure how far this project can go if it sticks with ptrace...