I’m pretty confident that GitHub doesn’t use foreign keys because it was built as a Rails app. And the “Rails Way” is to create these constraints in the model. Foreign key constraints weren’t a first-class member in Rails until v4 (if memory serves correctly).
I was once a full-time Rails dev and really loved the framework (I don’t write as many user facing applications these days). Most of the Omakase trade offs didn’t bother me. But I never understood the disdain for foreign keys. For 99.99% of web apps, you want them (dare I say, need them).
Even in Rails 3 I would add them by hand in the migration files. Very few applications will ever actually care about sharding. We were pulling in millions at the last Rails company I worked for and were just fine with a master and a single replica.
If you get to a point where sharding is best for your company you hopefully have enough revenue coming in to fund a data transition. Your goal should be to outgrow what MySQL (or Postgres) can do for you in master/replica mode. If you do, you’ll likely be independently wealthy...
If data integrity matters at all, model-based checks (or periodic queries for orphaned data) will not suffice. Just put a foreign key in the table where it belongs and let your DB do what it does best. ACID is an amazing thing...
> I’m pretty confident that GitHub doesn’t use foreign keys because it was built as a Rails app
Maybe originally, but lack of foreign key usage is certainly not Rails specific today. Large MySQL shops generally don't use foreign keys, full stop, for the exact reasons Shlomi described in the original comment.
Facebook does not use foreign keys either. In my experience, same thing is true at all the other large MySQL-based companies. And those companies make up a majority of the largest sites on the net btw -- including most of the consumer-facing social networking and user-generated content products/apps out there.
This does not mean that, for example, Facebook is full of data that would violate constraints. There are other asynchronous processes that detect and handle such problems.
> If you get to a point where sharding is best for your company you hopefully have enough revenue coming in to fund a data transition.
Do you mean, stay with your current RDBMS of choice and shard while also removing the FKs? Or do you mean transition to some other system? (and if so, what?)
The former path is bad: sharding alone is very painful even without introducing a ton of new application-level constraint logic at the same time.
The latter path is also bad: only very recent NewSQL DBs have sharding built-in, and you probably don't want to bet your existing rocket ship startup on your ability to transition to one under pressure. Especially given much higher latency profiles of those DBs, meaning a very thorough caching system is also required, and now you have all the fun of figuring out multi-region read-after-write cache consistency while also transitioning to an entirely new DB :)
“[S]harding alone is very painful even without introducing a ton of new application-level constraint logic at the same time.”
I typically espouse app logic to check state and use foreign keys to ensure enforcement (eg, race conditions that are very hard to ensure at the app level but are built into many RDBMS). Foreign key failures are just treated like any other failure mode.
But honestly, I haven’t been part of a company that have really hit those upper limits that require sharding. They exist, yes. But most companies will never need to worry about it. Which is my point.
I agree that most companies won't ever need to shard, and pre-sharding (or worrying about extreme scalability challenges in general) is usually unwise premature optimization. But it does depend on the product category.
Social networking and user-generated content products (including GitHub) need to be built with scale somewhat in mind: if the product reaches hockey-stick growth, a lot of scalability work will need to be completed very quickly or the viral opportunity is lost and the company will fail. I'm not saying they should pre-shard, but it does make sense to skip FKs with future sharding in mind.
This was nearly a decade ago, but in one case I helped shard my employer's DBs (and roll out the related application-level changes) with literally only a few hours to spare before the main unsharded DB's drives filled up. If we also had to deal with removing FKs at the same time, we definitely wouldn't have made it in time and the product literally would have gone read-only for days or weeks, probably killing the company. Granted, these situations are incredibly rare, but they really do happen!
It’s at its core a case by case decision, I think FK are also a net negative in data ingestion scenari where the data set is big enough.
Trying to make sure everything is where it needs to be at any given time, everything is inserted in the right order and the data is always consistent brings exponential amount of conplexity when it could all be checked at the end and pruned for invalid data. And usually DB integrity will not be enough, you’ll want business level validation that all is OK, so there will be app level checks anyway.
This looks nice, it’s still limit to a single commit though. That’s where it’s a PITA for anything that won’t (or we don’t want to) fit a single commit.
In particular splitting commits allows to ingest data in parralel (for instance if we import stores and store owners, both could be ingested separately without caring at first if each references a valid entity)
At least with mysql, and probably with postgres as well, you can temporarily turn off foreign key checks for a set of statements. So you can still get the benefits of foreign key constraints by default but when they do more harm than good you can turn them off. With the added benefit that turning off FK constraints screams "I am doing something unusual and dangerous - this requires extra caution."
App level data integrity is usually critical. Ingestion into a reporting database is an entirely different construct (where I agree FK constraints are burdensome).
It's been a while, but IIRC at the time Rails got started MySQL actually did not even support foreign key constraints. Since that was the DBMS of choice, it wasn't much of choice.
InnoDB's foreign key support predates the existence of Rails by several years. However, InnoDB wasn't the default storage engine for MySQL at the time, so that may be a factor.
I wonder how different the state of database application development would be today if all those cheap whitelabel webhosts powered by cPanel or Plesk (where most of us got started, I imagine) opted for PostgreSQL instead of MySQL - which would have influenced the major MySQL adopters like phpNuke, phpBB, WordPress, etc.
I was once a full-time Rails dev and really loved the framework (I don’t write as many user facing applications these days). Most of the Omakase trade offs didn’t bother me. But I never understood the disdain for foreign keys. For 99.99% of web apps, you want them (dare I say, need them).
Even in Rails 3 I would add them by hand in the migration files. Very few applications will ever actually care about sharding. We were pulling in millions at the last Rails company I worked for and were just fine with a master and a single replica.
If you get to a point where sharding is best for your company you hopefully have enough revenue coming in to fund a data transition. Your goal should be to outgrow what MySQL (or Postgres) can do for you in master/replica mode. If you do, you’ll likely be independently wealthy...
If data integrity matters at all, model-based checks (or periodic queries for orphaned data) will not suffice. Just put a foreign key in the table where it belongs and let your DB do what it does best. ACID is an amazing thing...