Update on 1/28 service outage

rburhum · on Jan 29, 2016

Yesterday I was being a bit of an ass to a few people about how "the whole point of using git is so that we can do decentralized code management and why these dependencies were being pulled from our private github if the could be sent point to point yadda yadda yadda". Then they proceeded to go over the list of package managers and dependencies we used and I had to shut up. Even when we host our own Docker Hub and package managers (we do), if you dig far enough, you can find some dependency of a dependency of dependency that relies on GitHub. Brew/npm/build script/whatever. It is crazy how everything has changed so much in the past few years. GitHub went from something that was really nice to have to a core requirement for complex systems that rely heavily on open source.

chatmasta · on Jan 29, 2016

If this is a problem worth solving, you can absolutely solve it. The easiest would be through the use of a caching proxy and/or load balancing system.

The caching proxy system could be as simple as setting up a squid cache for apt. Multiple projects exist which do this already.

The load balancing system would involve keeping a private mirror of every repository in the dependency graph, and falling back to the mirror when GitHub fails. To automate this, proxy all git requests. If github is up, let the request pass through. If no mirror exists for the repository, create one. If GitHub fails, fall back to the mirror.

yeukhon · on Jan 29, 2016

> The easiest would be through the use of a caching proxy and/or load balancing system.

Sorring if I sound like an asshole, but before we actually use the word "easiest", can you please share with us how to do all of that. It is not as easy as you claim, to be honest. Not just some /etc/hosts hijack.

brazzledazzle · on Jan 29, 2016

They didn't say it was easy, just that it was easiest. Ostensibly comparing it to other solutions that they are aware of.

tylorr · on Jan 29, 2016

For the load balancing system I would assume you would also want to keep the mirrors up to date as well? So for every request if mirror exists and is out of date, update it.

chatmasta · on Jan 29, 2016

Presumably that would be the responsibility of the mirroring system once a mirror is initiated.

viperscape · on Jan 29, 2016

The package system for the rust language actually relies on github, as many found out during outage. I don't know if that will change, probably will with a read copy in a different git service.. but I thought it was interesting because I use github for everything save a few private projects, as I imagine most do. I'm not sure what to think of this, it seems backwards and grossly incompetent, yet here we are using it almost exclusively. It might be smart to decentralize some of this with torrents, if that's possible. Even if it was the read portion of a repository, it seems like something to consider, if it hasn't been already

brinker · on Jan 29, 2016

> The package system for the rust language actually relies on github, as many found out during outage.

This is not quite correct (although close to it). Cargo doesn't rely on GitHub, but it expects that there is some publicly-accessible git repository from which it can pull the source for any crate, and most crates use GitHub. So it's not a particular choice of Cargo, but a side-effect of GitHub's popularity in the community, and the fact that Cargo does not host source code itself.

gkoz · on Jan 29, 2016

You're mixing up cargo features and the crates.io package system. Cargo does allow git dependencies but primarily you're supposed to use versioned crates.io packages, which are indeed provided by crates.io (even if they are actually hosted on S3 or whatever) not GitHub.

agwa · on Jan 29, 2016

The crates.io index is in a GitHub repository. Does Cargo fetch it directly from GitHub or from a set of redundant mirrors?

stouset · on Jan 29, 2016

If crates.io hosted the content themselves, it would just be the same problem, only with a service potentially less reliable than GitHub.

kbenson · on Jan 29, 2016

Unless you set up a system of mirrors. There are plenty of examples[1] to draw from.

1: http://mirrors.cpan.org/

st3v3r · on Jan 29, 2016

Not if they acted like a mirror. Put it on GitHub, Bitbucket, and crates.io

rms_returns · on Jan 29, 2016

Not just rust language, to the best of my knowledge, even packagist, the php package manager relies heavily on github for sourcing its packages. But I think they have other resources too, apart from github.

jdminhbg · on Jan 29, 2016

Ruby's bundler doesn't entirely rely on Github, but pulling from a Github repo is a supported option that many take advantage of.

tatterdemalion · on Jan 29, 2016

Rust's package manager doesn't source packages from GitHub (though it will pull packages from a git repo if you ask it to), the source for its index of packages is a git repository on GitHub. https://github.com/rust-lang/crates.io-index

seiji · on Jan 29, 2016

if you dig far enough, you can find some dependency of a dependency of dependency that relies on GitHub. Brew/npm/build script/whatever

But really, why?

Is it just institutional laziness on the part of all developers? We had reliable rsync CPAN mirrors in 1995. In the early days of the Internet, companies would mutually host secondary DNS for each other to be more reliable. For some reason, we've forgotten all about reliability and disaster recovery and geographical distribution. Now the collective programmer mindset with regards to global infrastructure seems to be "lol, we're too dumb to make things work, let's just outsource everything to closed source, for-profit companies and hope for the best."

hguant · on Jan 29, 2016

I think a large part of this is that cloud hosting has allowed us to abstract those problems - reliability, disaster recovery, geographical distribution - away, and we don't really think of computers as computers anymore. It's a service or a platform or what have you, and the expectation is that it will always be there. I wouldn't say this is laziness, just a byproduct of changing how we view Internet architecture. We systems to take care of reliability etc because everyone has those problems. Now, those are only things you'll experience if you host your own stuff, or work for one of the big providers. (Broad assertion, I know, but I think it's mostly true)

tubularhells · on Jan 29, 2016

One has to keep in mind that there is no cloud. It's just someone else's computer.

Pharaoh2 · on Jan 29, 2016

Except that it is not. It redundant array of computers, if one goes down, another takes it place and all the apps running on it are migrated to the new hardware. And if the whole zone goes down, the apps are migrated to a different zone. If the whole region goes down, the apps can be migrated to a different region. The 9s are so high that you don't have to worry about hardware issues anymore, unlike when you are running your own hardware.

_ondq · on Jan 29, 2016

That's the theory (or the marketing pitch, depending upon perspective).

The reality can be rather different[1][2][3].

1. http://money.cnn.com/2011/04/21/technology/amazon_server_out...

2. http://www.zdnet.com/article/amazon-web-services-suffers-out...

3. http://www.theregister.co.uk/2015/09/20/aws_database_outage/

gregmac · on Jan 29, 2016

Or it could be literally an old desktop computer sitting in someone's damp basement on a DSL connection. The problem with just saying "the cloud" is you can't tell the difference.

Pharaoh2 · on Jan 29, 2016

Generally when people say the cloud, they mean one of the big Public/Private cloud providers, not someone's basement.

semi-extrinsic · on Jan 29, 2016

https://xkcd.com/908/

seiji · on Jan 29, 2016

Exactly right, but over the past six years there's been a strong (and accelerating) trend among developers of "lalala we don't want to know how anything works! give us an API and go away."

Most developers I've seen reject even learning about networks or DNS or operating systems or databases. Such willful ignorance boggles the mind, but they are praised because their goals are shipping half-broken things as rapidly as possible to flip upwards for those oh-so-tasty acquihire payouts.

We even saw this week how overconsumption of convenience APIs can put entire companies in danger when those privately controlled convenience APIs just decide to shut down one day. Convenience of immediacy always seems to trump connivence of long term stability.

nemothekid · on Jan 29, 2016

>Exactly right, but over the past six years there's been a strong (and accelerating) trend among developers of "lalala we don't want to know how anything works! give us an API and go away."

I will argue that this trend has always existed. I'm sure you can find an x86/68k/z80 developer complaining that developers are going "lalala we don't want to know how anything works! give us an the C-language and go away."*

I'm sure there are developers who couldn't imagine learning C without learning x86, and saw developers learning C without learning x86 as "willful ignorance".

Good abstractions will cause developers to simply gloss over how they work.

simula67 · on Jan 29, 2016

As programmers, we need to know atleast one level below the abstractions to which we are programming to. For example, if you program in C you need to know a little bit of assembly, how objects are laid out in memory etc. This is how you write fast code and it helps with debugging too.

But if you are programming in C and notices that something goes wrong with the hardware, ( for example, an instruction does not do something that it is supposed to do ) you will have to ask for help since it is someone else's work that is faulty. Sounds reasonable ?

nickpsecurity · on Jan 29, 2016

At least one level below. Hmm. That sounds more reasonable than full understanding. See my reply to nemo, though, for an alternative that I think is more reasonable. Basically, heuristics and simplified models.

nickpsecurity · on Jan 29, 2016

Or work safely, effectively, and productively when taught how to properly use the abstractions. They can optionally be taught how they work underneath for better results. Yet, I don't have to teach people caches to tell them to group variables closely for performance. I likewise can give very basic explanations of stacks and heaps plus heuristics for using them. People still get the job done.

Functional programming proves my point even more where they don't know how the hardware functions or even use the same model. Yet, with good compiler and language design, they can make robust, fast, and recently parallel programs staying totally within their model. Most problems we pick up outside the abstraction gaps can be fixed in the tooling or with interface checks.

So, I think the common perception of people doing crap code while working within an abstraction is unjustified and even disproven by good practices in that area. Much like I would be unjustified in accusing assembly coders of being "willfully ignorant" or working within foolish abstractions because they didn't know underlying microprogramming or RTL. They don't need it: just knowledge of how to effectively use the assembly. Actually, I saw one commenting so let me go try that real quickly. :)

seiji · on Jan 29, 2016

couldn't imagine learning C without learning x86

One difference: C->x86 is a static translation layer. Other network/system things dynamically change out from under your "designed" system and alter threat/security/disaster/reliability/consistency models in a potentially unpredictable combinational fashion.

Saying "cloud abstraction" or "I trust this API and don't care how it works" is basically committing every https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu... and just saying "X can't break because we use provider Y who guarantees they can violate the laws of physics for us!"

semi-extrinsic · on Jan 29, 2016

The good reverend Laphroaig preaches:

If the 0day in your familiar pastures dwindles, despair not! Rather, bestir yourself to where programmers are led astray from the sacred Assembly, neither understanding what their programming languages compile to, nor asking to see how their data is stored or transmitted in the true bits of the wire. For those who follow their computation through the layers shall gain 0day and pwn, and those who say “we trust in our APIs, in our proofs, and in our memory models and need not burden ourselves with confusing engineering detail that has no scientific value anyhow” shall surely provide an abundance of 0day and pwnage sufficient for all of us.

nickpsecurity · on Jan 29, 2016

An assembler elitist with a semi-fallacious argument. Let's rewrite that in view of a lower-level elitist to show it still looks true, shows love for assembler as foolish pride, and still fails to matter in face of good, high-level tools.

If the 0day in your familiar pastures dwindles, despair not! Rather, bestir yourself to where programmers are led astray from the sacred RTL/Transistor language, neither understanding what their assembly languages and microprograms compile to, nor asking to see how their data is stored or transmitted in the true bits of the CPU's network-on-a-chip and memory plus analog values and circuitry many run through at interfaces. For those who follow their computation through the layers shall gain 0day and pwn, and those who say “we trust in our assemblers, our C compilers, our APIs, in our proofs, and in our memory models and ISA models and need not burden ourselves with confusing engineering detail that has no scientific value anyhow” shall surely provide an abundance of 0day and pwnage sufficient for all of us.

Source: LISP, Forth, and Oberon communities who did hardware to microcode to language & OS all integrated & consistent. :P

semi-extrinsic · on Jan 29, 2016

I surmise the good reverend elevates Assembly not because it is fundamental, but because it is a level deeper than the domain of coders who yield unto us exploitable codes. Verily, I demand of ye, produceth thou the exploit of an Assembly 0day that was wrought from Transistor language!

nickpsecurity · on Jan 29, 2016

Rowhammer. :P

semi-extrinsic · on Jan 29, 2016

Haha. If you look up the rowhammer exploit on wikipedia, the example is in assembly ;)

nickpsecurity · on Jan 29, 2016

I thought you'd hit me with the Javascript one but yeah lol.

wesleyfsmith · on Jan 29, 2016

Isn't this the nature of abstraction though? As the high level tools get increasingly powerful at solving common problems people will invest less in learning their underlying implementations.

I'm sure all the assembly programmers were complaining that the C programmers had no respect for "how anything works".

swalsh · on Jan 29, 2016

I mean a really simple solution (simple to say, maybe not to do) would be for package managers to require a "backup" repository from a different domain, than if you get a 500 error try the second remote repository. Use git for its advantages.

ugexe · on Jan 29, 2016

I think you mean a mirror, and many package managers use them.

forrestthewoods · on Jan 29, 2016

And people give me shit when I argue that open source projects should include 100% of dependencies.

kbenson · on Jan 29, 2016

I think that's a bit crazy as well. This is a problem if your build process happens often and requires pulling external data. Ideally, you want a way to cache that external data, and a way to force invalidation of that cache.

Building, at least after the first time, should not require external access. There are security reasons for this as well.

forrestthewoods · on Jan 29, 2016

So your proposed solution is one of the only two hard problems in computer science? That should be a solid clue that you're wrong.

"There are only two hard things in Computer Science: cache invalidation and naming things."

-- Phil Karlton

kbenson · on Jan 29, 2016

By "a way to force invalidation of that cache" I didn't mean automatic invalidation, I meant a way to flag that you want it to re-download dependencies and store them for later use. I'm not sure where you got the requirement that it needs to automatically determined by a computer from my comment. I was thinking the "cache" could be as simple as the person setting up the build environment downloading the dependencies and configuring the build to use them. That's a local cache, when discussing automatic downloading of dependencies during building.

Set up your build environment with whatever manual intervention is required so that it can run without downloading remote resources. Build as needed. There is no reason for, and many reasons against, downloading dependencies during the build process, but that doesn't necessitate duplicating those dependencies within your own source tree. As long as there are directions on how to download a specific, definitive version of the dependency, whether that is automated or not isn't really a big deal if it's done infrequently.

cdelsolar · on Jan 29, 2016

wow, i never realized cache invalidation was one of the ONLY two hard problems in CS

gaius · on Jan 29, 2016

The quote is supposed to be, two hard problems: cache invalidation, naming conventions and off-by-one errors.

zwetan · on Jan 29, 2016

It's not, if you can not even run that first build then you actually have nothing to work on.

Also, not frozen dependencies means you are at the mercy on any dependencies changes breaking your build at any time.

With that, even if your first build run and go fetch those deps and can build at T1, it is not guaranteed at all that the build will work at T1+n.

There is a big difference between your team working from trunk and your team being dependent on other projects trunk.

JonathonW · on Jan 29, 2016

Just because you're downloading your dependencies at runtime doesn't mean you have to have non-frozen dependencies or non-repeatable builds... that's one of the advantages of pulling dependencies out of a Git repository; specify a specific revision to build against and that code is guaranteed* to not change. Pulling dependencies from Git doesn't mean you're working against trunk.

Now, if you're doing this with mission-critical software, you should probably be maintaining mirrors of those dependencies locally on infrastructure you control, but, again, that's another of the things that Git makes easy.

You should never be dependent on a reference that can move, unless you're willing to accept the consequences (that includes branches in any version control system, tags if you don't have infrastructure to verify that they haven't changed, external non-version-controlled downloads, etc.).

Basically, what you should learn here is that you shouldn't build your business around a third-party service's continued availability. Especially if it's a third-party service where you're not paying for an SLA, like Github. Reproducibility of builds is a different issue, and including 100% of your dependencies in your own source repository is not the only solution to it.

* Barring a SHA-1 collision, which is highly unlikely with Git.

kbenson · on Jan 29, 2016

> It's not, if you can not even run that first build then you actually have nothing to work on.

Obviously you can run the first build. You wouldn't be using Github if you never got it working in the first place.

To clarify, setting up the build environment may require network access, but if the process of building requires it, there are many places where it can go wrong, both operationally and security wise.

> Also, not frozen dependencies means you are at the mercy on any dependencies changes breaking your build at any time. ...

I agree, but that's a separate discussion and doesn't really apply here. There's nothing preventing the pulling of a specifically tagged version for builds. If someone's build process that used Git for dependencies is not doing this, whether they are using Github or some internal server is irrelevant, the same problems apply.

dsjoerg · on Jan 29, 2016

how far down the stack do you go? do open source projects need to include their own compiler? what would compile it?

nickpsecurity · on Jan 29, 2016

I suggested how far they need to go in context of Debian's reproducible builds posts:

https://news.ycombinator.com/item?id=10182282

That would solve readability, plenty of subversion, verifiability, much of portability, and perform anywhere from OK to good. Not going to happen but academics and proprietary software already did it to varying degrees. As post noted, traceability & verification from requirements to specs to code to object code is a requirement for high assurance systems. My methods, mostly borrowed from better researchers, are the easiest ones to use.

prodigal_erik · on Jan 30, 2016

I don't have to bootstrap anything that my distro is already shipping. If I'm using GCC, my .spec file has a BuildRequires tag that tells rpmbuild to make sure an acceptable version is present (from my RPM mirror).

If I'm using some obscure tool that my distro doesn't package, that's when I mirror the version I'm using, and build my own RPM from source if it needs to be deployed to prod servers rather than merely run from rpmbuild.

forrestthewoods · on Jan 29, 2016

* source code

* static libraries

* dynamic libraries

Provide compiled libs for the platforms of your choice. Preferably all three of Windows, OS X, and Linux. Users can issue pull requests if there is a platform or variant they wish to add.

zwetan · on Jan 29, 2016

Same here but I don't care all deps HAVE TO be in the repo, period.

In fact I go further than that, anything that a project depends on HAVE TO be "saved" somehow somewhere: use a special commercial tool ? save it, use some particular OS ? save the ISO, need a particular version of a compiler / SDK ? have an installer ready, etc.

But nowadays it seems dev program temporary stuff meant to last just few months.

jessaustin · on Jan 29, 2016

If you personally run software for which reliability is important, absolutely you should maintain your own vendor repos. Open source projects are not in that position, and following your advice would lead to much harmful coupling and repetition.

nickpsecurity · on Jan 29, 2016

That's a good point. I've been ignoring learning Git as long as I can but almost everything on my todo list heavily uses it. Or ties into it as you said. So, I'm going to have to bite the bullet and learn it.

Yet, I swore Git fans told me its decentralized design avoids single points of failures where everyone has a copy and can still work when a node is down just not necessarily coordinate or sync in a straight-forward way. This situation makes me thing, either for Git or just Github, there's some gap between the ideal they described and how things work in practice. I mean, even CVS or Subversion repos on high availability systems didn't have 2 hours of downtime in my experience.

When I pick up Git/Github, I think I'll implement a way to constantly pull anything from Git projects into local repos and copies. Probably non-Git copies as a backup. I used to also use append-only storage for changes in potentially buggy or malicious services. Sounds like that might be a good idea, too, to prevent some issues.

ajkjk · on Jan 29, 2016

I'm sorry to be rude, but, it sounds like you should go learn Git and come back to this conversation.

The decentralized design does avoid single points of failures, and everyone does have a copy. So - check, check, great. Unfortunately (maybe..) everyone has put their master repos in the same place, which somewhat counteracts the decentralization. But there is certainly no immediate coupling between the Git repository on your computer and the Github repository it's pulling from. It's not like Github being down in any way prevents you from working on code you've already checked out, unless you need to go check out more code.

(The same obviously may not be true for package managers and build scripts that are not running in isolation from your upstream repository, which is where the problems have arisen.)

nickpsecurity · on Jan 29, 2016

"I'm sorry to be rude, but, it sounds like you should go learn Git and come back to this conversation."

It looks like it.

"The decentralized design does avoid single points of failures, and everyone does have a copy. "

So, like many decentralized systems I've used, a master node gets worked around by other nodes who communicate in another way? Or would some retarded situation be possible where...

"Unfortunately (maybe..) everyone has put their master repos in the same place, which somewhat counteracts the decentralization."

...one node going down could prevent collaboration? Oh, you answered that. That sounds better than CVS but shit by distributed systems standards. I'll still learn it anyway since everyone is using it. Probably in next week or two.

ajkjk · on Jan 29, 2016

No, it's not the same as a distributed system with master/slave nodes. The child nodes can function entirely in isolation from the parent. If you wanted to, you could treat another coworker's node as your master and download/upload to that. It's usually easier to have a tree structure where the root is your master repo, its children are your build servers or whatever, and the leaves are development machines. But that's entirely reconfigurable.

It's not surprising at all that if you make a master repo at the root of the tree, and it goes down, then you can't communicate it. But it doesn't prohibit any communication between other nodes, or re-wiring the tree, and it definitely doesn't inherently block development work on any of the other nodes.

It just so happens, though, that people's build scripts and package managers like to refresh packages from the root and don't handle failures modes of that operation very well. That's the only place problems emerge - besides the obvious fact that if your public releases of software go through the root, and the root is down, then you can't release until it's up. But you could easily make a new root if you wanted to.

nickpsecurity · on Jan 29, 2016

"It just so happens, though, that people's build scripts and package managers like to refresh packages from the root and don't handle failures modes of that operation very well. "

That's the critical part. So, countering this risk is apparently a manual thing if one uses off-the-shelf tooling for Git. I'll just have to remember to look at that if I do a deployment. Put it on a checklist or something.

nemothekid · on Jan 29, 2016

>So, countering this risk is apparently a manual thing if one uses off-the-shelf tooling for Git.

Not so much off-the-shelf tooling for Git, its more off-the-shelf tooling for Node/Ruby/Go/Rust/PHP.

Nothing about Node's npm really requires it to depend on a single GitHub, in fact I think you can use any Git repo. Its just that most tend to use a single Git repo, and there is no way to configure mirrors.

nickpsecurity · on Jan 29, 2016

Thanks for the extra detail.

"and there is no way to configure mirrors."

Its that in Git itself or the project-specific tooling you're mentioning?

nemothekid · on Jan 29, 2016

There is no way to configure mirrors with the project-specific tooling (AFAIK).

Git, (and like most other DVCS) supports mirroring. For example Linux, hosted on Github, (https://github.com/torvalds/linux/commits/master) is also mirrored and hosted on kernel.org (https://git.kernel.org/cgit/). Or, the apache projects (https://github.com/apache/cassandra), which are also hosted on apache.org (https://git-wip-us.apache.org/repos/asf?p=cassandra.git). Generally when commits are merged with upstream, they are mirrored to all other hosts.

The tools, however, are generally configured with only the GitHub address (or the author of the tool only publishes to GitHub), and the tools (unlike say Perl's CPAN) don't offer to maintain mirrors of the libraries published. So when github is down, a tool like npm will give up, even though the author could have another git repo host elsewhere.

cesarb · on Jan 29, 2016

> For example Linux, hosted on Github, (https://github.com/torvalds/linux/commits/master) is also mirrored and hosted on kernel.org (https://git.kernel.org/cgit/).

It's the opposite: Linux is hosted on kernel.org, and the mirror on github.com is just something that was created during a kernel.org outage. The canonical address is the kernel.org one.

(The Linux repository on kernel.org, by the way, is one of the oldest git repositories; IIRC, it was created when git was only a few weeks old.)

nickpsecurity · on Jan 29, 2016

So, the protocol is definitely good enough to handle situations like this but just commonly applied that way esp with many Github-hosted projects. Gotcha. That makes sense.

snowwrestler · on Jan 29, 2016

Git is very flexible and does not even require repo-to-repo communication over the wire at all; patches can be emailed among contributors and then committed and tracked locally. Branching and merging is so fast and easy in git that every participant can have a slightly different repo for a given project, incorporating shared changes as they see fit.

Github is popular because it is opinionated--it chooses to use git in certain ways, thus reducing the complexity for people who aren't git experts (i.e. most people).

The most sophisticated users of git--the Linux and git projects, probably--do not rely on github at all. As far as I know, they share code via emailed patches. Some of those developers might not even be using git at all! They just send patches upstream, and the upstream developer checks the patches into their local git repo and then preps a larger patch to be emailed farther upstream.

nickpsecurity · on Jan 29, 2016

That's pretty wild. Sounds like main program/protocol is very true to the UNIX philosophy of tooling. My early reads on it suggested that gave it both it's power/versatility and horrific UI consequences for beginners. An opinionated UI and host like Github is a natural consequence.

I remember thinking in my early reading that git was like an assembly language for build systems. It really needed a front-end of some kind to smooth things over for new and casual users. Maybe not as heavyweight as Github but better than the main program. Can keep the low-level stuff in for advanced users.

Was that or is that still a common assessment or was my initial impression off?

snowwrestler · on Jan 29, 2016

There have been a number of attempts to build a more friendly front-end for git:

https://git-scm.com/download/gui/linux

Github (which provides a desktop app in addition to their website) is by far the most successful one, I think because they define a whole simplifed and social experience, not just a client.

To me it seems like people seem to segment into two camps: those who want to do the basics (they tend to use GitHub), and those who want to use the full power of git (they tend to use the CLI).

grey-area · on Jan 30, 2016

This is a social problem, not a technical one.

tehbeard · on Jan 29, 2016

It's a pebkac issue. The software is fully capable of having multiple remotes, but it's rarely used that way.

zwp · on Jan 29, 2016

Is there an easy config for that? Suppose I want to push to eg github and bitbucket (without sharing my creds with ifttt or similar)? Is a post-receive hook on a local pseudo-master the way to go?

Symbiote · on Jan 29, 2016

See, for example, here: http://stackoverflow.com/questions/14290113/git-pushing-code...

    git remote set-url --add --push origin git://original/repo.git
    git remote set-url --add --push origin git://another/repo.git

nickpsecurity · on Jan 29, 2016

Lol. Nicely put.

kbenson · on Jan 29, 2016

Git works as advertised, but when all your build processes start with a sync from the upstream master (the equivalent of "svn up") that a lot of build scripts required that to work, then they've thrown away that advantage when building.

Everyone with a checked out repo should have been able to develop and commit, branch and merge locally fine though.

nickpsecurity · on Jan 29, 2016

Thanks for the clarification. This is the exact sort of thing I was wondering about.

jallmann · on Jan 29, 2016

> either for Git or just Github, there's some gap between the ideal they described and how things work in practice

The hub-spoke topology is the easiest way of distributing source code to a lot of people. If the hub goes down, this is what happens. If that leads to a halt in productivity, then that is a failure in contingency planning. Git gives you many tools to distribute your workflow, but that won't save you if your workflow is centralized around Github.

Granted, sometimes you don't really have a choice whether to depend on Github, such as when working with language package managers. Perhaps that goes to show that mirroring and resiliency should be a design consideration in those tools, but it's not a shortcoming of Git itself.

> even CVS or Subversion repos on high availability systems didn't have 2 hours of downtime

It's easier than ever to have HA with a DVCS: clone the repository somewhere else and keep it in sync with commit hooks.

Large FOSS projects (should) do this by keeping a self hosted repository, and mirroring somewhere else like Github, Bitbucket, etc. Internally, an org should be able to quickly stand up a SSH or HTTP server for the purpose, or have collaborators push-pull directly from each other. Worst case? Send patches. Git apply works really well, and you might be surprised at how clever git-merge is when everyone finally syncs up.

That's what it means to be distributed: there is no real concept of a "central" node, unlike Subversion. Every local checkout has a full copy of the repository history. Any centralization is a (somewhat understandable) incidental artifact of how Git is being used.

nickpsecurity · on Jan 29, 2016

Makes sense. I'll try to remember that for my future checklist. Thanks for the details. Btw, you're site is down on my end from 2 browsers on my desktop and one on mobile. Might want to look into that as rest are working.

jallmann · on Jan 29, 2016

> Btw, you're site is down on my end

Hah, because it's been defunct for a while now. Thanks for the reminder, removed it from my profile.

nickpsecurity · on Jan 29, 2016

mbakke · on Jan 29, 2016

> I used to also use append-only storage for changes in potentially buggy or malicious services. Sounds like that might be a good idea, too, to prevent some issues.

In a certain sense, git is "append-only". If you change a commit in history, every ancestor commit will have its SHA hash changed. Naturally this will conflict with other copies of the repository.

For backups you should do a "git clone --bare" which checks out the internal git structure with data and history, but not the actual files.

nickpsecurity · on Jan 29, 2016

I figure it's append only at protocol level. Usually a smart idea for SCM. Is that still true when the whole datacenter goes down in mid-operation? Typically varies from implementation to implementation of the concept.

maker1138 · on Jan 29, 2016

Git is to GitHub as JavaScript is to Java. Though their names are similar they are very different things.

debaserab2 · on Jan 29, 2016

git != github

nickpsecurity · on Jan 29, 2016

Hence Git/Github in my comment. I already know there's a difference. I just don't know much more than that until I learn the two.

kbenson · on Jan 29, 2016

Github is to git as Sourceforge is (used to be) to subversion, but with a better UI.

And yes, there have been concerns raised about what would happen if Github took a turn like Sourceforge, which usually get brought up when information about new shady practices at Sourceforge come up (or they get rehashed here).

nickpsecurity · on Jan 29, 2016

Makes sense. I'm quite interested in seeing where it goes over time. I think it will depend a lot on the nature of the company. If it's VC-funded & aiming for acquisition, then there's a decent chance of Sourceforge history repeating. Otherwise, it might stick around as a beneficial ecosystem. Time will tell.

debaserab2 · on Jan 29, 2016

If you understand the difference between the two, you'd realize your comment makes no sense. The fact that github went down due to a power failure has nothing to do with git as a solution.

The fact that everyone uses git more or less the same as svn is the problem. Git is decentralized, but because so many people rely on github most don't ever use the decentralized aspect to it.

nickpsecurity · on Jan 29, 2016

If you understood my comment, you'd know I don't understand the differences between the two that much since I haven't studied them yet. Been clear in a few comments on that. The reason I associate them here is that most projects I see don't just use Git: they use Github, too. So, I briefly wonder and get feedback about how inherent Github-style downtime was or if it was configuration/deployment issues.

Several commenters helpfully described how Git can easily prevent stuff like this and that project-level stuff is why this is a liability. That's good to know as it's already a selling point to management types for a solution like it. Can just ensure the problem doesn't show up in a local deployment by a wiser configuration.

debaserab2 · on Jan 30, 2016

I understood your comment just fine, but the opinion you had formed was based on false assumptions, so I was trying to correct it, that's all.

Personally I try not to form strong opinions about things I haven't actually learned or understood yet.

nickpsecurity · on Jan 30, 2016

All good haha

skewart · on Jan 29, 2016

Am I the only one who is a little shocked that a power outage could have such a huge effect and bring them down for so long? I'm not an infrastructure guy, and I don't know anything about Github's systems, but aren't data center power outages pretty much exactly the kind of thing you plan for with multi-region failover and whatnot. Is it actually frighteningly easy for kind of to happen despite following best practices? Or is it more likely that there's more to the story than what they're sharing now?

Bender · on Jan 29, 2016

I am not at all surprised. There are 'best practices' and then there is what really happens based on business processes and needs. In reality, even the most cloudy of cloud providers will run into this problem at some point. Folks often come up with ideas of implementing something like Chaos Monkey in their data-center, then realize the actual impact it will have and find it is almost impossible to get the rest of the business to agree to this concept. It isn't as easy at it sounds. I only know of two businesses that have actually implemented Chaos Monkey; one being the company that coined the term. Even regular reboots won't catch these problems and if folks were honest, you would see +1 year up-times on most servers in most places. That is just based on my experiences and I am sure some of you have seen different.

JetSetWilly · on Jan 29, 2016

The problem is most environments are very heteregenous. I evaluated chaos monkey approach for a big bank, the issue is that netflix has whole data centres full of loads of machines doing pretty much the same thing, streaming and serving.

And the worst that can happen is a customer's stream stops and they have to restart it.

But in most big companies you have thousands of apps that are all doing very different things. Perhaps a critical app might run on 4 hosts spread across two data centres - you're not going to convince people to have chaos monkey regularly and randomly bringing down these hosts, it would cause real impact and is risky. Yeh in theory it should be able to cope but in reality the scales in most orgs are quite different.

That said github sounds a lot more like the netflix end of the scale, doing one specific thing at large scale.

drather19 · on Jan 29, 2016

While Netflix as a company is focused at doing one specific thing at large scale, they're heavily vested in microservices and do actually have "thousands of apps that are all doing very different things".

Chaos Monkey fits when people build and deploy their services with the notion that any particular instance (or dependency) could fail at any given time. It's a tough road to evolve out of a legacy, monolithic stack without much redundancy baked in.

JetSetWilly · on Jan 29, 2016

Whether they have broken up their apps into microservices doesn't seem to matter to me. That's just a matter of how they have organised their code, whether the actual app is monolithic or microlithic doesn't seem to matter.

They have a focussed business with relatively little variation in how they make money - all their customers simply pay for a streaming service.

Most large companies, certainly banks anyway, have thousands of apps because there's also thousands of different parts of the business making money in their own unique ways that have their own unique needs.

What works for netflix therefore can't work for other businesses, because the actual business is much more heterogenous than that of netflix and the technology will reflect that whether it is organised in microservices or monolithically - that's totally irrelevant.

JimDabell · on Jan 29, 2016

> Perhaps a critical app might run on 4 hosts spread across two data centres - you're not going to convince people to have chaos monkey regularly and randomly bringing down these hosts, it would cause real impact and is risky. Yeh in theory it should be able to cope but in reality the scales in most orgs are quite different.

The difference between theory and reality is precisely the reason Chaos Monkey and tools like it exist.

What you're essentially saying is that in theory, these systems have been designed to be resilient, but in reality, they may not be. If that's the case, then you'd better verify your resiliency, because being resilient in theory but not reality isn't going to help you when your service goes down.

JetSetWilly · on Jan 29, 2016

That's true, but if an app, say, is running on 4 hosts doing some boutique thing for a small unit of 20 traders, then the practical reality is that they might not want Chaos Monkey bringing down 25% of the throughput randomly, and interrupting whatever actual cash money requests are in progress on a host.

Itsa lot easier to promote that if it is thousands of servers doing something fairly mundane where, worst-case, it not working means a tiny tiny proportion of your customers have to restart their video stream. So what?

But for a small hetereogenous business where what's happening has a much higher cash density, the actual practicalities of randomly killing things in production and the risk that represents rather get in the way, even though in theory you should be able to kill anything in production with minimal impact, you are much less inclined to take that risk when the stakes are higher.

nvarsj · on Jan 29, 2016

I think you're missing the point. The point of something like chaos monkey is to force you to build a system that won't lose money by "bringing down 25% of the throughput".

JetSetWilly · on Jan 30, 2016

My point is that nomatter how well engineered your system is, to actually have chaos monkey running in production really depends on the risk profile and scale of your business.

As soon as chaos monkey cause a service interrupt for, say, traders - it would get turned off and whoever had such a bright idea fired. But if it causes a service interruption for a tiny proportion of people watching streaming videos - no big deal.

Its proponents just ignore this practical reality and seem politically unaware.

lomnakkus · on Jan 29, 2016

> In reality, even the most cloudy of cloud providers will run into this problem at some point.

Actually, wasn't this[0] what did happen several years ago when Amazon Ireland went down for days on end?[1]

[0] TL;DR: Cascading effects of power outage.

[1] http://readwrite.com/2011/08/08/amazons-ireland-services-sti... (didn't read the article, it was just high in the google search results)

skewart · on Jan 29, 2016

Interesting. But if, lets say, a data center in London where they have a lot of boxes goes down completely, then they spin up boxes in Frankfurt and Riga to take up the load and reroute traffic. Service is disrupted for some customers for a few minutes. Some people lose some stuff completely because replication wasn't happening perfectly. But the entire site doesn't go down for everyone for two hours.

Are those kinds of failover scenarios frequently messy and risky at the scale of Github? Or is it more likely that in the context of a fast growing company, and even at a place as "cloudy" as Github, there are bound to be some serious bugs lurking in your system design?

chadaustin · on Jan 29, 2016

I've experienced a brief full-scale power loss at a data center before. It is unbelievable how much goes wrong. The machines had been chugging along for years, happily doing their job, but on the next boot the hard drives were suddenly corrupted, or the power supplies broken. The impacts of that power outage were felt for at least six months.

It's one of those things where, if you're not regularly cutting power to your data center, you're not building resilience to such a thing happening. So when it does, it's not pretty. :)

why-el · on Jan 29, 2016

> if you're not regularly cutting power to your data center, you're not building resilience to such a thing happening

Would love to read examples on who is doing this and how? Reminds me of Netflix's Choas monkey, only applied to electricity. :p

paultela · on Jan 29, 2016

There's a mention of Facebook regularly doing this in the summary section of this instagram engineering post: http://engineering.instagram.com/posts/548723638608102/

EDIT: Here's more info: http://www.datacenterknowledge.com/archives/2014/09/15/faceb...

why-el · on Jan 29, 2016

Awesome, thank you. :)

detaro · on Jan 29, 2016

I remember reading a few years back that Yahoo once a week takes a random data center offline, just to make sure they could do that without issues. They probably didn't actually cut the power ;) But they used it as an argument against investing to much in emergency generators and such: they'll fail or cause accidents and you need the ability to fail-over either way, so make it routine.

nickpsecurity · on Jan 29, 2016

I think trying to cut power at least once is better if it's possible. The reason is that digital is just an abstraction over analog, electrical activity. Plus there's actual analog in there doing work, too. So, seeing how all the chips in there respond to an actual and instantaneous drop of the power would be an interesting test of the models they're built against.

Like an above commenter mentioned, weird activity in electrical system can make some products go haywire and even corrupt data in unexpected ways. Of course, simulated takedowns and all appropriate measures for countering common issues should've already happened before a real one. Just to be extra clear there.

pramodliv1 · on Jan 29, 2016

Google wrote an article about disaster recovery in 2012. https://queue.acm.org/detail.cfm?id=2371516

otterley · on Jan 31, 2016

What data center was it?

I can't remember the last time there was a power outage at a Tier I or II data center -- they're all N+1, from the cabinet PDUs to the distribution units to the UPSes to the diesel generators. Some even go so far as to connect to multiple in-feeds from different utility providers.

At my company, every piece of server, storage, and network equipment we own is connected via redundant power supplies to different circuits (except for nonessential equipment like monitors; we can simply re-plug them into the functioning circuit). I can't imagine running a datacenter any other way.

vorador · on Jan 29, 2016

I have no doubt the people at Github have spent a lot of time thinking about multi-region failover. You never hear about the successful failovers --- only the ones which cause outages. To quote a famous US politician: "There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know."

You can't failover things you didn't predict.

nickpsecurity · on Jan 29, 2016

Except you can predict it. Your fail-over mechanism needs to be able to detect these things:

1. Degraded performance that might be a fault justifying fail-over. A human in the loop is a must here as complex services can just act weird under load or randomly.

2. Corrupted data or packets coming in that might indicate a failure. Might automatically fail-over here.

3. No data coming in at all for 5-10 seconds, esp on a dedicated line. Fail-over automatically here as nothing sending data is already the definition of downtime and probably indicates a huge failure.

Companies should also do plenty of practice fail-overs at various layers of the stack during non-critical hours to ensure the mechanisms work. In Github's case, number 3 should've applied and solutions far back as 80's would kick in automatically within seconds to minutes. Their tech or DR setup must just not be capable of that. There could be good financial reasons or something for that but not technological.

untog · on Jan 29, 2016

Heh, that quote always amuses me. People hated it, but it actually does make a lot of sense.

coliveira · on Jan 29, 2016

My experience matches exactly what Github says. Power outages can bring down even the best systems. The problem is that it is never clear what parts of the systems will continue to work in these situations, until it actually happens. Especially when you're talking about complex applications that depend on many moving pieces. The point is, the more complex your online app, the more points of failure can be exposed in these situations.

nickpsecurity · on Jan 29, 2016

We've been mitigating against this kind of thing with backups at other datacenters or colos for a while. They can be hot standby, cold standby, slightly degraded in performance, whatever. I also recommend the backup be on a different part of the overall power grid in case it cascades in failure. The good colo's often have connections to multiple backbones, too, which is extra redundancy.

That all assumes there's a total and catastrophic failure at main datacenter. If not, there's local backup batteries to sustain a smoother, fail-over plus shutdown. Plus, there's tricks like isolating the monitoring systems from main systems and power supply using things data diodes over octocouplers or infrared. At least one thing will still be working and feeding you reliable information over a wireless connection after the full failure.

NonStop and VMS setups from late 80's did better than Github. My own setups involving a minimum of servers plus apps with loose coupling could fail-over in such a situation. So, this just has to be bad architecture caused by who knows what. Examples below of OpenVMS in catastrophic situations having either no downtime or short downtime due to good architecture plus disaster planning.

Case study of active-active at World Trade Center http://h71000.www7.hp.com/openvms/brochures/commerzbank/comm...

Marketing piece where HP straight-up detonates a datacenter. Guess who was number 1 in recovery. :) https://youtu.be/bUwthF9x210?t=34s

jacquesm · on Jan 29, 2016

I doubt it matters to anybody but was it really necessary to kill the fish?

jakobegger · on Jan 29, 2016

Watch it until the end :)

nickpsecurity · on Jan 29, 2016

Haha nice catch. I missed it originally thinking it would just be more marketing crap. So, they probably just moved it before detonating. Not sadistic bastards after all.

jacquesm · on Jan 29, 2016

Watch 'The Prestige'.

nickpsecurity · on Jan 29, 2016

I know... My original reply mentioned two scenarios with one having replacement fish. Then, I thought people would think I'm overly paranoid or negative. I just couldn't help wonder if they'd blow the fish for fun then avoid liability with similar looking ones. Then, I edited the comment for sake of presumption of innocence.

But, yeah, I hear you... Great movie as well. One of few that brings my favorite mad scientist into eye of mainstream audience as well. I doubt I must name him. :)

EDIT to add: I'm guessing you think the geeks were too sadistic to pass up the opportunity, eh?

jacquesm · on Jan 29, 2016

> I'm guessing you think the geeks were too sadistic to pass up the opportunity, eh?

No, it just seems weird, that's all. I don't see how either interpretation would benefit HP.

nickpsecurity · on Jan 29, 2016

Oh OK. Makes sense.

jacquesm · on Jan 29, 2016

I did, but unless the whole video was a fake it doesn't really matter does it? And if it is then that does not reflect well on HP either...

nickpsecurity · on Jan 29, 2016

I didn't think about the image angle. Yeah, you'd think a marketing person would be like, "Wait, this could lead to a PETA lawsuit and lower sales. Not to mention our segment that likes fish."

nickpsecurity · on Jan 29, 2016

Dude, I was thinking the same thing! That was seriously f*ed up. They should've left some cool fireworks or something left-over from July 4th. Or some safe-ish chemical that would make colorful smoke. All kinds of tricks you can do without killing live animals.

I mean, I've heard about things so wrong and ease it's like shooting fish in a bucket but... exploding fish in a datacenter? That's on another level.

nacs · on Jan 29, 2016

> HP straight-up detonates a datacenter

Apparently 5 server racks in the middle of an open field is a "datacenter".

nickpsecurity · on Jan 29, 2016

Nah, a collection of computers with high-availability setup communicating with another collection constantly over a dedicated or high-bandwidth line. In the demo, it was 5 racks in an open field. In the bank study, it was a whole bank's worth of computers in two locations. For some organizations, it's 5+ of them just to be sure.

The common trend is that the systems constantly sync critical data, can detect downtime, and automatically (or manually) fail-over when it occurs. Been OS's and ISV's offering that capability with many proven in field going back decades. Certain high-tech companies just don't apply those for whatever reasons. Maybe their stacks just still don't have that feature.

nodesocket · on Jan 29, 2016

I am surprised at the data center. Power failure is one of the most basic parts of being N+1 for a data center. That is why they have batteries (last a few minutes) and then diesel generators (last days if needed).

detaro · on Jan 29, 2016

Stuff happens, and even if you test all kinds of things real failure situations always can work differently, with partial failures etc. Just takes one important subsystem hitting an unforeseen edge case, and going completely down is in many cases better than risking running in a state that destroys data or does other bad things. Same for taking your time to go back online.

The cases that work are not the ones you hear about. Best practices and testing reduce the risk of making the news, but can't guarantee success.

wyldfire · on Jan 29, 2016

The only way you can build fault-resilient systems is to frequently test fault injection scenarios. Netflix is pretty mature in this regard, perhaps Github can learn from their example.

That said, it's possible that github may have considered that this particular style of outage is rare enough that they don't want to make their design tolerate it. Though if that were the case, I'd wager they'd re-evaluate the cost/benefit right around now. :)

brink · on Jan 29, 2016

Gotta love it when a top comment starts with "Am I the only one".

nvarsj · on Jan 29, 2016

It is a bit surprising actually. It means they haven't built their app to be tolerant of single DC loss, either on purpose or because they didn't test it properly.

Purely conjecture, but I suspect since github uses mysql cluster they only write to a single dc, which would be the primary dc that failed in this case.

moondev · on Jan 29, 2016

I'm shocked as well. You would think they would deploy in multiple availability zones at the very least.

nickpsecurity · on Jan 29, 2016

Here's the only page I could quickly find on Github's architecture for those interested:

https://github.com/blog/530-how-we-made-github-fast

This looks like a single datacenter. I don't see anything here indicating high availability or other datacenters. You'll usually spot either an outright mention of it or certain components/setups common in it. They might have updated their stuff for redundancy since then. However, if it's same architecture, then the reason for the downtime might be intentional design where only a single datacenter has to go down.

Might be fine given how people apparently use the service. It's just good to know that this is the case so users can factor that into how they use the product and have a way of working around the expected downtime if it's critical to do so.

bhaak · on Jan 29, 2016

"Millions of people and businesses depend on GitHub"

Well, we shouldn't depend on it so much.

I shudder at the thought what an outage of GitHub would mean for our company. This time, we were lucky as it was during the night in Europe.

Unfortunately, I don't have the power to test this scenario in our company.

banku_brougham · on Jan 29, 2016

I like others am confused by this common sentiment. Github is the remote repo, but the version control is distributed so everyone has a copy. I'm pretty sure I can fill a few hours or more with work needing to be done on my local repo. FYI I'm not a professional software developer but I would like to know.

The things that come to mind: issue trackers, messaging, not being able to see latest pull requests.

Update: Now i'm starting to understand the build dependency issue. Still, why do you need to rebuild all dependencies from GitHub repo to build the application? Can't the currently available version work?

yeukhon · on Jan 29, 2016

Continuous integration, continuous delivery. Your Jenkins jobs all point to repos on GitHub? Do you plan to fix every single url? Some tools actually pull stuff from GitHub. If you don't have a mirror privately somewhere, where do you push your code? How can you tell you actually own the latest of everything? Time to compare with every co-worker.

cookiecaper · on Jan 29, 2016

It shouldn't really have much effect. One of git's major selling points is that it's a DVCS, meaning that everyone has a local copy of the repository. Perhaps some collaboration features will be down for a couple of hours (which I think is a downside to GitHub's decision not to put issues/PR history inside of git), but everyone should still be able to do work, commit to the repo, review history, and so forth. If you have people who do code, they can probably find something to work on for two hours without having the Issues/PR interface, right?

Domenic_S · on Jan 29, 2016

All sorts of other dependencies go down though. Packages you need for your build aren't there. CI or testing integrations don't happen. Code review is probably not happening. If you track issues in GH you can't see what's next to work on or look up requirements.

You're right in that you're (probably) not totally deadlocked. But I can't start to estimate the lost $$ in productivity that comes with a global GH outage because of all that.

raverbashing · on Jan 29, 2016

Have a local repos that mirrors the master one on GitHub periodically

Should that fail, start working on the local repos until github is back, then sync back to it

pc86 · on Jan 29, 2016

Depending on your definition of "periodically" you may lose almost as much time to syncing back than the outage would have caused without the local mirrors.

Sanddancer · on Jan 29, 2016

I've written scripts that do this. Any request for a repo is polled against the local repo server that makes sure it has the repo, and then quickly checks to see if the repo's out of date, caching the resulting file if the repo can be reached. If the repo can't be reached, just have the proxy deliver the old fileset. So the local repo gets updated, or at least attempts to update, with every hit against it. I had some other logic in the script to only check freshness every 10-15 minutes, so that during times when a lot of machines were pulling, they were essentially guaranteed to all get the same version.

pc86 · on Jan 29, 2016

This is certainly one of the better ways to do it - when I see a word like periodically I assume it means daily/weekly/on some sort of calendar-based schedule, which isn't necessarily the case here.

seiji · on Jan 29, 2016

Why is the master on GitHub anyway?

If a company can't maintain their own internal tools and self-hosting servers, why does the same company think it can run reliable services for users?

Not putting the core of your business on a remote platform is disaster mitigation 101.

kbenson · on Jan 29, 2016

Why use AWS, GCE or any other virtualization provider? I suspect for some subset of companies the answer is the same.

Relying on Github is not the problem, relying on Github to be available 24/7 is. Github provides a free master node for your eventually consistent database needs, where the database is git. The eventual portion is key here.

zwetan · on Jan 29, 2016

The key word is infrastructure.

Github should not be the master, it should be a mirror of a company master that they host on their own server.

The main problem with that is some company do not want the cost of the infra + the cost of the sysadmin to set that up, etc.

The second problem is the build, even if you host your own repos, if all your dependencies are on github and you don't include them in the repo, then you are bust.

anton_gogolev · on Jan 29, 2016

It's one thing when one temporarily loses access to remote repositories for pushes. Quite bearable, because you can exchange code across your corporate network using patches and whatnot. And it's totally different when you cannot friggin build anything because package managers grab dependencies directly off of GitHub.

msbarnett · on Jan 29, 2016

This is more an argument for caching or vending dependencies than anything else.

If the ability to make builds is critical to your org, making your build process depend on the availability of third-party services over which you have no control is going to end in tears.

banku_brougham · on Jan 29, 2016

This is it. Production builds have to have dependencies hosted internally, not all over the web.

saidajigumi · on Jan 29, 2016

Agreed. The modern ease of pulling in third-party dependencies, while wonderful in its way, has gotten so easy that even "simple" applications require automated caching infrastructure. E.g. if you just fork your top-level dependencies, you won't pick up any of your recursive dependencies.

I suppose we all need package manager and git/VCS aware recursive forking/caching tools now. E.g. works with npm, gem, etc. and recursively forks your entire dependency chain.

And to think that I managed that sort thing of entirely by hand some years back. (For C/C++ libs, then, so far more manageable.)

bjacobel · on Jan 29, 2016

Not much detail here. A more thorough postmortem would give me more confidence they can recover from another similar issue. Hoping to see one soon.

anon987 · on Jan 29, 2016

Yep, I think most of these post-mortems from any company are pointless from a technical perspective. It's 4 paragraphs that boils down to "someone did something wrong and we'll make sure it doesn't happen" with zero specifics.

There's no point in reading these because there's no technical information. Stuff like this is something you sent to your customer because they want root cause.

Zikes · on Jan 29, 2016

I strongly disagree that these sorts of communications are pointless. In every major service outage I've seen where the company maintained a degree of silence, it's caused major damage to their public relations and consumer trust.

I know it doesn't tell you much about exactly what happened, but the truth is they may still be sorting that out and focusing on ensuring it does not happen again. An in-depth post-mortem accompanied by a description of the fix would be great. In the meantime, admitting culpability and apologizing are the ideal essential first steps.

Zikes · on Jan 29, 2016

I agree that a postmortem would be great, but it's good PR for companies to quickly put out statements like this to admit fault and maintain customer trust.

outworlder · on Jan 29, 2016

Give them time.

frik · on Jan 29, 2016

You can see the cascade effect on their status page graphs: https://status.github.com/

Loic · on Jan 29, 2016

What is impressive is that with a website 2h down, they can still announce a 97% availability for the day even so the graph clearly shows the 2h of failures in the day... :-/

WillAbides · on Jan 29, 2016

The 97% you see on the status page is for the past 24 hours. That doesn't include any of the outage being discussed here.

arthurschreiber · on Jan 29, 2016

Unless I'm mistaken, 97% of (24 hours) = 23.28 hours.

mirekrusin · on Jan 29, 2016

yes, it went down to 89% or something just after the problem.

ceejayoz · on Jan 29, 2016

Interesting that their exception logging didn't get turned back on until this morning, from the looks of things.

rcthompson · on Jan 29, 2016

Well, if exception logger was going off nonstop due to the outage, yet not providing any new information, it would make sense if they disabled it until things had returned to normal.

tommoor · on Jan 29, 2016

This post makes it sound like Github has it's own data centers and power infrastructure structure, this is definitely news to me.. I'd presumed co-lo at best.

noazark · on Jan 29, 2016

The last news I've heard about it was back in 2009, https://github.com/blog/493-github-is-moving-to-rackspace. But I've also heard that they have some infrastructure on site (clearly not what they were talking about).

seiji · on Jan 29, 2016

"data center" is a confusing term.

Very few companies build their "data centers" (apple, google, amazon, NSA, actual 'data center' companies, etc). Most companies rent cage space in a larger data center and call that their "private data center." Smaller companies will rent a few dedicated servers or colo half racks from other resellers.

brazzledazzle · on Jan 29, 2016

Unless it's explicitly stated to be a wholly owned data center I always assume companies are talking about rack space in a bigger DC like supernap.

moondev · on Jan 29, 2016

Github doesn't deploy their services in multiple az's?

rs999gti · on Jan 29, 2016

Maybe they do.

But this two hour failure tells me that they have never really tried a hot failover and failback scenario in order to test the resiliency of their site.

detaro · on Jan 29, 2016

Or something happened that didn't happen in the tests. And if they suspected something might be in an inconsistent state, taking some downtime to make sure it comes back up properly clearly is the better option.

moondev · on Jan 29, 2016

Hope we get more info about it. Would be very interesting to see how their architecture is setup

beachstartup · on Jan 29, 2016

it seriously makes me lol that people are upset, or surprised, that an internet service went down for a couple of hours. a couple of hours! get some perspective please. go for a walk, get a tasty burrito, try a new brand of hot sauce.

"why didn't they do X, Y, or Z"

the answer in every case is it's extremely expensive, or extremely hard to do, or both. you want a reason, there's the reason. maybe they'll fix it. maybe they won't. next question.

make your own backups and redundant systems. "but github is so critical!" -- even more reason to have a backup. bad shit happens in this world. even to good people. prepare or suffer the consequences.

ljk · on Jan 29, 2016

Maybe I'm ignorant, but why do companies rely on github? Why not just host it in-house? If there's power outage in the office then everything would be down anyways, right?

danneu · on Jan 29, 2016

A rare two hour Github outage isn't enough to make anyone on my team want to start dicking around with internal tools.

gavazzy · on Jan 29, 2016

Would it be possible for a cross between Git and Torrents? Rather than having a central server to pull/push from, instead the server would provide a list of clients. If the server goes down, the list is still available, and so people who depend on it would be able to communicate.

MichaelRenor · on Jan 29, 2016

Why is it so hard for us to distribute our dependencies? Hash the package to a sha and put t anywhere on the Internet. Then we just need a service that holds and updates the locations of the hashes and we can fetch them anywhere.

ibejoeb · on Jan 29, 2016

For those that have been affected by this, what parts of your process were disrupted? I've read, so far:

  * Build fails due to unreachable dependencies hosted by GitHub
  * Development process depends on PRs

free2rhyme214 · on Jan 29, 2016

Chinese DDoS? Somehow I don't buy power going out at a server farm.

oxguy3 · on Jan 29, 2016

Why not? Things break. Electricity is one of those magical things that's very hard to have insanely good uptime -- frankly, it's incredibly impressive that power outages aren't more common.

And why would GitHub not disclose that it was a DDoS? They were very forthcoming when there actually _was_ a Chinese DDoS last April: http://arstechnica.com/security/2015/04/ddos-attacks-that-cr...

And in a DDoS, the service typically becomes slower and slower until it reaches the point where only like one in a hundred requests succeeds. With the GitHub outage, it died fairly instantaneously, and it was completely 100% dead. There was no timeout as the servers tried to respond -- the "no servers are available" error page loaded instantly every time.

cjbprime · on Jan 29, 2016

> Chinese DDoS? Somehow I don't buy power going out at a server farm.

You should read more about server farms.

johnhenry · on Jan 29, 2016

Considering the attacks within the past year, I was thinking the same thing. I hate to spread conspiracies without foundation, but I wonder if anyone has seen a assessment on the cyberkinetic capabilities of nations around the world?

smaili · on Jan 29, 2016

It's always scary when a cloud service you rely on goes down but great to see GitHub recover. Well done!

out_of_protocol · on Jan 29, 2016

Various date/time formats across the world bringing me to the knees. If 1/28 outage was _that_ rough 2/28 would be twice as bad and 28/28 would feel like armageddon maybe?

on Jan 29, 2016

[deleted]

Zikes · on Jan 29, 2016

I'm starting to think that people should mirror their packages to BitBucket as a rule, and that package managers should round robin/flip a coin between the two, or use whichever is available in case of outages.

rch · on Jan 29, 2016

I'd rather have something like Netflix's Open Connect Appliances, covering all of Github, sitting in each office and a centrally located colo facility.

Zikes · on Jan 29, 2016

I'm not familiar with Open Connect Appliances, but Github as a platform is still a Single Point of Failure at least on some level. Domain or SSL issues, for example.

I think that as long as we have options to host packages on other platforms in addition, it should be seriously considered. At the very least, it would encourage a more competitive atmosphere for open source hosting services.

ryanfitz · on Jan 29, 2016

I recently read a blog post from Github about them operating their own datacenter http://githubengineering.com/githubs-metal-cloud/

Im not positive, but it sounds like a fairly recent switch from a cloud provider to their own datacenter. If thats the case, Id expect a number of outages to come in the following months.

secure · on Jan 29, 2016

AFAIK, they never used a cloud provider.

ryanfitz · on Jan 30, 2016

Github was hosted at rackspace, here is there blogpost about it https://github.com/blog/493-github-is-moving-to-rackspace

From their blog posted last month:

As we started transitioning hosts and services to our own data center, we quickly realized we'd also need an efficient process for installing and configuring operating systems on this new hardware.