Re #1 – at the rate the NSA is reported to be archiving data, do you _really_ th...

res0nat0r · on Aug 30, 2013

Yes. A previous employer of mine was apparently logging a PB of log data a day between all departments, it isn't something that is just tossed up and somehow works in a month.

Do you have infrastructure in place to store all of that? How are you storing all of this content coherently so that you can go back and audit records of systems at a point in time you are concerned about? Do you have an internal Hadoop cluster configured (not trivial) and the hardware to ingest and process these logs? How are you actually logging everything? Do you have dedicated and qualified people to write this software and maintain it? When someone has su -'d to root, how are you logging everything they type coherently? Is your regex smart enough to ingest and contextualize multiline commands as root and add them to your report? What if no actual "bad" commands on your list were typed as root? What if I as a luser copied the "bad" commands to /tmp/kittens.txt, then su -'d to root and ran them there? Is this somehow captured? Are you 100% sure syslog is running 100% of the time on all hosts you are concerned about? Can't I as a sysadmin kill the syslog pid before I commit a crime? Are you using redundant syslog hosts? Are you using TCP and not the default UDP so that syslog doesn't drop packets under load?

Like I said, this isn't as simple as everyone thinks...

bigiain · on Aug 30, 2013

Ballpark numbers - an employee typing at 100wpm for 40 hours a week generates not quite 70MB worth of keystrokes in a year - and I'd guess that estimate is probably something like 2 orders of magnitude too high - no-one actually types 100wpm nonstop all day every single day at work. Even ignoring that, I've got enough disk space sitting under my TV to store (uncompressed) every keystroke typed in a year by something like 100,000 such mythical "100wpm typists working at 100% utilisation" employees. 1PB would, to a first approximation store every single keypress it'd be possible for the rumored 4 million "top secret or above clearance" people in the US typing flat out for something like 15 years.

So I don't think your "do you have the infrastructure" argument holds too much water here.

The "do you have internal Hadoop clusters" question falls the same way. Sure, I don't have that connected to my 8TB of external USB drives plugged into my media server - but if I worked at Google or FaceBook or Twitter, I'd fully expect to be able to provision and spin up adequately sized clusters of VMs and storage to effectively consume and run reports on data that size and bigger. And it's surely not just the NSA and Google/Facebook/Twitter routinely dealing with collecting and processing data at that scale - any decent sized telco, any non-trivial web analytics service, any large financial institution, every HFT business, most bio-med businesses, probably every physics and astronomy department at any university – there must be tens of thousands of businesses routinely dealing with that sort of sized data sets.

It's not simple - certainly not simple enough for me to do it on a Mac Mini and a bunch of external hard drives – but I also don't think it's anything like "uncharted waters" territory. (I'm pretty sure I could find the expertise required in my 1st level LinkedIn connections, and have absolutely no doubt I'd be able to manage designing, developing and deploying exactly such a system if someone came to me with a high six or low seven figure budget.)

mirkules · on Aug 30, 2013

I think your approximations are focusing solely on keystrokes, whereas the parent specified just "data", which makes me believe that network traffic, application logs, etc are included in this. I can believe the 1PB/day number, it's not that far fetched when you consider the above.

I also agree with the parent. There are so many possible scenarios and things to log that eventually you're playing a "logging" version of whack-a-mole. Even just managing these files (as the parent talks about) is really no trivial task. Honestly, I wouldn't even know how to begin managing a petabyte worth of daily data.

bonzoesc · on Aug 30, 2013

Spying on all of the email isn't simple either. When there's a will and a bottomless font of taxpayer money in the name of national security, there's a way.

MichaelGG · on Aug 30, 2013

First, you're tossing 100% around way too much - security in this environment is a pretty huge sliding scale. But for something as basic as password resets, yes, I'd expect 100% functionality and auditing to catch it.

Second, I don't think anyone thinks it's totally trivial. But, this is the NSA we're discussing. Supposedly, their secrets are so special, the USA will collapse or something if they're compromised. If any organization in the world is set to handle an admin resetting people's passwords, it's the NSA.

Are you really arguing that it's just too hard for the NSA to notice a massive violation of policy?

res0nat0r · on Aug 30, 2013

Sysadmins are supposed to reset passwords...or I'm sure there is some central auth service in place that allows access to hosts one has been determined to have access to.

The people you trust to admin systems are going to easily be able to abuse their power and it is very hard to stop them from doing so, without making their job so cumbersome for it to be near impossible.