> I miss that direct connection. The fast feedback. The lack of making grand plans.
There's no date on this article, but it feels "prior to the MongoDB-is-webscale memes" and thus slightly outdated?
But, hey, I get where they're coming from. Personally, I used to be very much schema-first, make sure the data makes sense before even thinking about coding. Carefully deciding whether to use an INT data type where a BYTE would do.
Then, it turned out that large swathes of my beautiful, perfect schemas remained unoccupied, while some clusters were heavily abused to store completely unrelated stuff.
These days, my go-to solution is SQLite with two fields (well, three, if you count the implicit ROWID, which is invaluable for paging!): ID and Data, the latter being a JSONB blob.
Then, some indexes specified by `json_extract` expressions, some clever NULL coalescing in the consuming code, resulting in a generally-better experience than before...
> These days, my go-to solution is SQLite with two fields (well, three, if you count the implicit ROWID, which is invaluable for paging!): ID and Data, the latter being a JSONB blob.
Really!? Are you building applications by chance or something else? Are you doing raw sql mostly or an ORM/ORM-like library? This surprises me because my experience dabbling in json fields for CRUD apps has been mostly trouble stemming from the lack of typechecks. SQLite's fluid type system haa been a nice middle ground for me personally. For reference my application layer is kysely/typescript.
> my experience dabbling in json fields for CRUD apps has been mostly trouble stemming from the lack of typechecks
Well, you move the type checks from the database to the app, effectively, which is not a new idea by any means (and a bad idea in many cases), but with JSON, it can actually work out nicely-ish, as long as there are no significant relationships between tables.
Practical example: I recently wrote my own SMTP server (bad idea!), mostly to be able to control spam (even worse idea! don't listen to me!). Initially, I thought I would be really interested in remote IPs, reverse DNS domains, and whatever was claimed in the (E)HLO.
So, I designed my initial database around those concepts. Turns out, after like half a million session records: I'm much more interested in things like the Azure tenant ID, the Google 'groups' ID, the HTML body tag fingerprint, and other data points.
Fortunately, my session database is just 'JSON(B) in a single table', so I was able to add those additional fields without the need for any migrations. And SQLite's `json_extract` makes adding indexes after-the-fact super-easy.
Of course, these additional fields need to be explicitly nullable, and I need to skip processing based on them if they're absent, but fortunately modern C# makes that easy as well.
And, no, no need for an ORM, except `JsonSerializer.Deserialize<T>`... (And yeah, all of this is just a horrible hack, but one that seems surprisingly resilient so far, but YMMV)
> Fortunately, my session database is just 'JSON(B) in a single table', so I was able to add those additional fields without the need for any migrations. And SQLite's `json_extract` makes adding indexes after-the-fact super-easy.
Our solution for a similar situation involving semi-structured data (in postgres) was to double it up: put all the json we send/receive with a vendor into a json field, then anything we actually need to work on gets extracted into regular table/columns. We get all the safety/performance guarantees the database would normally give us, plus historical data for debugging or to extract into a new column if we now need it. The one thing we had to monitor in code reviews was to never use the json field directly for functionality.
This is exactly what I've tried (and failed at) doing! Can I ask how you handle normalization from vendor data when it contains relationships and multilevel nesting? How do you know when to create a new child table, and which ones to create, and their relationships etc. I haven't found a good balance yet.
Basically what the other reply said - handle it the same as you would any complex data. You just don't need to handle all of the json immediately, only the parts you plan on using for the moment.
Odd-shaped miscellaneous data that you only need to retrieve is a good candidate for a JSON field. Once you're heavily using some piece of data, or if you need to index it (which means you are heavily using it), you should insert the data in the database "properly".
If some vendor is giving you a list of categories you don't care about, there's no need to make a vendor categories table and a many-to-many link table until you actually need them.
The point is that putting data properly in the database lets you use database features on it and get database performance.
> Well, you move the type checks from the database to the app, effectively, which is not a new idea by any means (and a bad idea in many cases), but with JSON, it can actually work out nicely-ish, as long as there are no significant relationships between tables.
That way you're throwing away 50% of the reason you use a relational database in the first place. Has it occurred to you that MongoDB exists?
Also I don't understand why you're afraid of migrations, especially since you're the only developer on your own SMTP server.
> My original comment started with "but it feels "prior to the MongoDB-is-webscale memes""
Which feels off by six generations or so of memes. It feels prior to "memes" existing in the first place (at least this modern iteration, pics with captions in them; not Dawkins' original verion). I'd guess it is, ironically, chronologically closer to, well, your username here.
You certainly would lose a lot of things, like a well supported path to linking with to the database engine, and a straightforward way to start to introduce relational tables as the project matures. Nothing completely insurmountable, of course, but carry a lot of extra effort for what benefit?
How does MongoDB handle someone pulling the power cord out of the server? Because that’s another reason to use something like SQLite, and it often gets used in embedded systems.
> That way you're throwing away 50% of the reason you use a relational database in the first place. Has it occurred to you that MongoDB exists?
Did you miss that he’s using sqlite? The dev experience with a sqlitedb is way better than running yet another service, especially for personal projects.
Sqlite is used just as much as an application file format as it is a relational database.
> Fortunately, my session database is just 'JSON(B) in a single table', so I was able to add those additional fields without the need for any migrations.
> And SQLite's `json_extract` makes adding indexes after-the-fact super-easy.
That's a migration.
> Of course, these additional fields need to be explicitly nullable, and I need to skip processing based on them if they're absent
That's an effect of not migrating - having to process null and absent fields instead of just null fields. After doing more of these, you'll run into the same thing that made people stop using NoSQL databases: with no schema, your code has to parse all previous versions of the data format and they probably aren't even well-documented. While an RDBMS can just set the new column to null in existing rows.
> And, no, no need for an ORM, except `JsonSerializer.Deserialize<T>`... (And yeah, all of this is just a horrible hack, but one that seems surprisingly resilient so far, but YMMV)
I do the same thing with serde_json in Rust for a desktop app sqlitedb and it works great so +1 on that technique.
In Rust you can also tell serde to ignore unknown fields and use individual view structs to deserialize part of the JSON instead of the whole thing and use string references to make it zero copy.
“Without the need for any migrations” seems like a weird one to me. Of all of the things I don’t like having to do, migrations are pretty low on my list - far below any scripting-in-prod or applying hacks or even just rudimentary business logic changes. Granted, I used to fear them a lot back when I was expected to apply them by hand rather than use CICD and testing to validate/apply them
Recently I’ve been tempted to make an SMTP server that translates emails into a web-hook. Please tell me more horror stories so that I might be convinced not do it.
because they put everything in JSON. Migration means running a script to parse and edit each JSON item instead of letting the database do database things automatically.
In my current company, we're using a similar approach: just shove everything into a JSON blob. If you need a constraint or an index, you can create a computed column (in PostgreSQL) that pulls out a field from JSON.
For the data schema, we're using Protobufs with buf validate. This works surprisingly well, you can use the same types in the backend API and on the frontend. We even have a cron job that reads all the data periodically and verifies that the JSON blobs conform to the schema. Our next intern is going to write a PostgreSQL extension to do that on commit/update :)
One real advantage of this approach is that it's easy to do stuff like "search everywhere".
Do you go to the trouble of updating individual values in objects using some kind of deep/partial updating function or do you just accept race conditions that come with updating full objects?
We use optimistic versioning, with a dedicated "version" field (that is actually always pulled out of the blob in all tables).
Classic fine-grained schemas are not that much different. A lot of high-level ORM frameworks simply save all the objects' fields on update, without doing fine-grained diffs.
In addition, our frontend apps also support offline mode. They can get all the relevant objects, and then operate on them locally. So our API was designed from the start to deal with conflicts.
> We use optimistic versioning, with a dedicated "version" field (that is actually always pulled out of the blob in all tables).
All well and good but you do need to handle failures elegantly in code. The nice thing about flat DB tables and SQL is you don't really have to care if your goal is to update a single column authoritatively. The state of the other values in the table are, often, immaterial. It gets even more complicated reconciling deeply nested conflicts in open schemas.
Not knocking your approach, it's just a trade-off I guess.
Yeah, makes sense. To write safe code though at that point don't you have to enforce rigid schemas?
An example would be updating Kubernetes resources, the admission controller will verify the correctness of the change you are trying to make and reject the patch if it's non-comformant. Nested values have... value... in terms of contextualizing and isolating/localizing data leaves but at the end of the day aren't you still dealing with strict schemas and, when those schemas change you have to reconcile schema migrations?
> To write safe code though at that point don't you have to enforce rigid schemas?
Certainly. But you have to do that with SQLite anyway, because (by default, without strict mode [1] enabled) it won't stop you from putting the wrong type of data into a column.
I've worked on a system were a guy "migrated" a database from NoSQL to SQL as a proof-of-concept. Except he really didn't migrate anything. He basically created tables with "ID" (UUID) and "data" (json) columns everywhere. No indexes.
The actual "data" is a mess: different data types for the same JSON field. Imagine storing "price" as both a float and a string, depending on whatever buggy code happened to do the inserts.
It worked enough for a prototype and that was enough for management to believe the project was a success. I can't wait until we actually try and deploy it.
Have you had the pleasure of blowing young minds by revealing that production-grade databases come with fully fledged authnz systems that you can just...use right out of the box?
Databases have pretty robust access controls to limit (a sql user's) access to tables, schemas, etc. Basic controls like being able to read but not write, and more advanced situations like being able to access data through a view or stored procedure without having direct access to the underlying tables.
Those features aren't used often in modern app development where one app owns the database and any external access is routed through an API. They were much more commonly used in old school apps enterprise apps where many different teams and apps would all directly access a single db.
This is perfectly fine when you are driving some app that has a per-user experience that allows you to wrap up most of their experience in some blobs.
However I would still advise people to use a third normal form - they help you, constraints help you, and often other sets of tooling have poor support for constraints on JSON. Scanning and updating every value because you need to update some subset sucks.
You first point is super valid though - understanding the domain is very useful and you can get easily 10x the performance by designing with proper types involved, but importantly don't just build out the model before devs and customers have a use for anything, this is a classic mistake in my eyes (and then skipping cleanup when that is basically unused.)
If you want to figure out your data model in depth beforehand there's nothing wrong with that... but you will still make tons of mistakes mistakes, lack of planning will require last minute fixes, and the evolution of the product will have your original planning gather dust.
> Scanning and updating every value because you need to update some subset sucks.
Mirrors my experience exactly. Querying json can get complex to get info from the db. SQLite is kind of forgiving because sequences of queries (I mean query, modify in appliation code that fully supports json ie js, then query again) are less painful meaning it's less moprtant to do everytning in the database for performance reasons. But if you're trying to do everything in 1 query, I think you pay for it at application-writing time over and over.
It includes a reference to Backbone and Knockout JS, which were released in 2010, so presumably it was around that era. The database, though, was probably much older...
I think you can only judge that by knowing the context, like the domain and the experience of the designer/dev within that domain.
I looked at a DB once and thought "why are you spending effort to create these datatypes that use less storage, I'm used to just using an int and moving on."
Then I looked at the volumes of transactions they were dealing with and I understood why.
Deep down, the optimizer in me wants this to be true, but I'm having trouble seeing how this difference manifests in these days of super powerful devices and high bandwidth.
I guess I just answered my own question though. Supposing there's a system which is slow and connected with very slow connectivity and still sending lots of data around, I guess there's your answer. An embedded system on the Mars Rover or something.
I actually love your approach and haven’t thought of that before.
My problem with relational databases often stems from the fact that remodeling data types and schemas (which you often do as you build an application, whether or not you thought of a great schema beforehand) often comes with a lot of migration effort.
Pairing your approach with a „version“ field where you can check which version of a schema this rows data is saved with would actually allow you to be incredibly flexible with saving your data while also being able to be (somewhat) sure that your fields schema matches what you’re expecting.
Having to write and perform migrations for every small schema change is a bore, but it means your software doesn't have to worry about handling different versions of data. Going "schemaless" with version numbers means moving code from "write-and-forget" migrations to the main codebase, where it will live forever.
I think not doing database migrations only makes sense when you can make do without version numbers (or if you can't do atomic migrations due to performance constraints, but that's only a problem for a very small number of projects).
You’re correct there. I mostly work on CMSes with page builder functionality, which often bake the content schema into the database columns, which makes changing that schema (for new frontend features or reworking old ones) difficult and often prone to losing content, especially in dev environments.
Best case is obviously that you never have to version your changes, but I‘d prefer making a new schema and writing an adapter function in the codebase depending on the schemas version to spending a lot of time migrating old content. That might just be due to me not being too comfortable with SQL and databases generally.
> Not having to write and perform migrations for every small schema change is a bore, but it means your software doesn't have to worry about handling different versions of data.
Same here. If your entities are modelled mostly correctly you really don't have to worry about migrations that much. It's a bit of a red herring and convenient "problem" pushed by the NoSQL camp.
On a relatively neat and well modelled DB, large migrations are usually when relationships change. E.g. One to many becomes a many to many.
Really the biggest hurdle is managing the change control to ensure it aligns with you application. But that's a big problem with NoSQL DB deployments too.
At this point I don't even want to hear what kind of crazy magic and "weird default and fallback" behavior the schema less NoSQL crowd employs. My pessimistic take is they just expose the DB onto GraphQL and make it front ends problem.
I agree. Migrations have never been a problem at my company and this codebase is 9 years old. Just regular old postgres with a few JSONB columns of things that don't need to be relational.
Same for the database that was about 8-9 years old at my last company. Migrations are fine. It's the relationship-changing that is painful.
>> If your entities are modelled mostly correctly you really don't have to worry about migrations that much
I'm gonna take a wild guess here that you have never worked in unfamiliar domains (like lets say deep cargo shiping or subpremium loans) where your so called subject matter experts provided by client werent the sharpest people you could hope for and actually did not understand what they where doing for most of the time?
Because I on the other hand am very familiar with such projects and doing schema overhaul third time in a row for production system is bread and butter for me.
Schemaless systems is the only reason I'm still developer and not lumberjack.
I think they are referring to the fact that software development as a field has matured a lot and there are established practices and experienced developers all over who have been in those situations, so generally, these days, you don't see such code bases anymore.
That is how I read it.
Another possible reason you don't see those code bases anymore is the fact that such teams/companies don't have a competitive comp, so there are mostly junior devs or people who can't get a job at a more competent team that get hired in those places
"The rowid is implicit and autoassigned, but we want developer-friendly IDs." maybe
But of course, the obvious solution is to have one table with just ROWID and data, and another table with the friendly IDs! If you time the insertions really well, then the ROWIDs in both tables with match and voilà.
There's no date on this article, but it feels "prior to the MongoDB-is-webscale memes" and thus slightly outdated?
But, hey, I get where they're coming from. Personally, I used to be very much schema-first, make sure the data makes sense before even thinking about coding. Carefully deciding whether to use an INT data type where a BYTE would do.
Then, it turned out that large swathes of my beautiful, perfect schemas remained unoccupied, while some clusters were heavily abused to store completely unrelated stuff.
These days, my go-to solution is SQLite with two fields (well, three, if you count the implicit ROWID, which is invaluable for paging!): ID and Data, the latter being a JSONB blob.
Then, some indexes specified by `json_extract` expressions, some clever NULL coalescing in the consuming code, resulting in a generally-better experience than before...