More

bcrl · 2025-11-25T02:35:30 1764038130

There's a simple solution: don't use mmap(). There's a reason that databases use O_DIRECT to read into their own in memory cache. If it was Good Enough for Oracle in the 1990s, it's probably Good Enough for you.

mmap() is one of those things that looks like it's an easy solution when you start writing an application, but that's only because you don't know the complexity time bomb of what you're undertaking.

The entire point of the various ways of performing asynchronous disk I/O using APIs like io_uring is to manage when and where blocking of tasks for I/O occurs. When you know where blocking I/O gets done, you can make it part of your main event loop.

If you don't know when or where blocking occurs (be it on I/O or mutexes or other such things), you're forced to make up for it by increasing the size of your thread pool. But larger thread pools come with a penalty: task switches are expensive! Scheduling is expensive! AVX 512 registers alone are 2KB of state per task, and if a thread hasn't run for a while, you're probably missing on your L1 and L2 caches. That's pure overhead baked into the thread pool architecture that you can entirely avoid by using an event driven architecture.

All the high performance systems I've worked on use event driven architectures -- from various network protocol implementations (protocols like BGP on JunOS, the HA functionality) to high speed (persistent and non-persistent) messaging (at Solace). It just makes everything easier when you're able to keep threads hot on locked to a single core. Bonus: when the system is at maximum load, you remain at pretty much the same number of requests per second rather than degrading as the number of threads ready to run starts increasing and wasting your CPU resources needlessly when you need them most.

It's hard to believe that the event queue architecture I first encountered on an Amiga in the late 1980s when I was just a kid is still worth knowing today.

grep_it · 2025-11-25T04:29:26 1764044966

Relevant: https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf

leo_e · 2025-11-25T12:39:15 1764074355

You're right. O_DIRECT is the endgame, but that's a full engine rewrite for us.

We're trying to stabilize the current architecture first. The complexity of hidden page fault blocking is definitely what's killing us, but we have to live with mmap for now.

bcrl · 2025-11-25T14:57:21 1764082641

I am curious -- what is the application and the language it's written in?

There are insanely dirty hacks that you could do to start controlling the fallout of the page faults (like playing games with userfaultfd), but they're unmaintainable in the long term as they introduce a fragility that results in unexpected complexity at the worst possible times (bugs). Rewriting / refactoring is not that hard once one understands the pattern, and I've done that quite a few times. Depending on the language, there may be other options. Doing an mlock() on the memory being used could help, but then it's absolutely necessary to carefully limit how much memory is pinned by such mappings.

Having been a kernel developer for a long time makes it a lot easier to spot what will work well for applications versus what can be considered glass jaws.

man8alexd · 2025-11-25T08:52:50 1764060770

There is a database that uses `mmap()` - RavenDB. Their memory accounting is utter horror - they somehow use Commited_AS from /proc/meminfo in their calculations. Their recommendation to avoid OOMs is to have swap twice the size of RAM. Their Jepsen test results are pure comedy.

otterley · 2025-11-26T03:43:53 1764128633

LMDB uses mmap() as well, but it only supports one process holding the database open at a time. It's also not intended for working sets larger than available RAM.

hyc_symas · 2025-11-27T15:54:42 1764258882

Wrong, LMDB fully supports multiprocess concurrency as well as DBs multiple orders of magnitude larger than RAM. Wherever you got your info from is dead wrong.

Among embedded key/value stores, only LMDB and BerkeleyDB support multiprocess access. RocksDB, LevelDB, etc. are all single process.

otterley · 2025-11-27T21:32:49 1764279169

My mistake. Doesn’t it have a global lock though?

Also, even if LMDB supports databases larger than RAM, that’s it doesn’t mean it’s a good idea to have a working set that exceeds that size. Unless you’re claiming it’s scan resistant?

hyc_symas · 2025-11-28T03:56:42 1764302202

It has a single writer transaction mutex, yes. But it's a process-shared mutex, so it will serialize write transactions across an arbitrary number of processes. And of course, read transactions are completely lockfree/waitfree across arbitrarily many processes.

As for working set size, that is always merely the height of the B+tree. Scans won't change that. It will always be far more efficient than any other DB under the same conditions.

otterley · 2025-11-28T16:10:45 1764346245

> As for working set size, that is always merely the height of the B+tree.

This statement makes no sense to me. Are you using a different definition of "working set" than the rest of us? A working set size is application and access pattern dependent.

> It will always be far more efficient than any other DB under the same conditions

That depends on how broadly or narrowly one defines "same conditions" :-)

hyc_symas · 2025-11-29T09:58:13 1764410293

Identical hardware, same RAM size, same data volume.

otterley · 2025-11-29T15:21:09 1764429669

That’s a bold claim. Are you saying that LMDB outperforms every other database on the same hardware, regardless of access pattern? And if so, is there proof of this?

hyc_symas · 2025-11-30T09:02:11 1764493331

Plenty of proof. http://www.lmdb.tech/bench/

hyc_symas · 2025-12-02T11:34:16 1764675256

You don't have to take my word for it. Plenty of other developers know. https://www.youtube.com/watch?v=CfiQ0h4bGWM

otterley · 2025-11-30T16:54:14 1764521654

Since the first question of my two-part inquiry not explicitly answered in the affirmative: To be absolutely clear, you are claiming, in writing, that LMDB outperforms every other database there is, regardless of access pattern, using the same hardware?

hyc_symas · 2025-12-01T21:31:10 1764624670

Not every.

LMDB is optimized for read-heavy workloads. I make no particular claims about write-heavy workloads.

Because it's so efficient, it can retain more useful data in-memory than other DBs for a given RAM size. For DBs much larger than RAM it will get more useful work done with the available RAM than other DBs. You can examine the benchmark reports linked above, they provide not just the data but also the analysis of why the results are as they are.

bcrl · 2025-11-25T02:05:08 1764036308

voip.ms was pretty much offline for a couple of weeks while under a lengthy DDoS attack. They were only able to restore service by putting all their servers behind Cloudflare proxies to mitigate the ongoing DDoS.

bcrl · 2025-11-25T01:58:28 1764035908

I have cell phone calls regularly drop during tower handoffs, and codec errors that result in a blast of static upon answering a call. I can't remember a single time I had a phone call fail on the old PSTN built out of DMS10 and DMS100s locally (well, until we lost all trunks due to a fibre issue a couple of weeks ago on November 10th -- the incumbent didn't notice the outage which started at ~3:20am until ~9:30am, and it wasn't fixed until 17:38). One time when I was a teenager in the '90s, a friend and I had a 14 hour call using landlines.

The modern tech stack is disappointing in its lack of reliability. Complexity is the root of all evil.

bcrl · 2025-11-23T22:08:23 1763935703

It doesn't help that the big tech advertising platforms are making money off pushing ads peddling investment scams and malware. So long as there is a financial incentive to make money in ways that harm "users" things will only get worse. Governments have failed consumers.

bcrl · 2025-11-23T20:38:49 1763930329

Not sure how they're allowed to generate a profit or distribute dividends given the cost of the wildfires started by their complete and total failure to maintain equipment to minimum safety standards.

bcrl · 2025-11-22T22:49:47 1763851787

It's easier to write documentation than to completely rewrite a subsystem.

bcrl · 2025-11-15T22:16:15 1763244975

Circuit switching is not harder to do, it's simply less efficient. In the PSTN and ISDN world, circuits consumed bandwidth regardless of whether it was actively in use or not. There was no statistical multiplexing as a result.

Circuit switching packets means carrying metadata about the circuit rather than simply using the destination MAC or IP address to figure out routing along the way. ATM took this to an extreme with nearly 10% protocol overhead (48 bytes of payload in a 53 byte cell) and 22 bytes of wasted space in the last ATM cell for a 1500 byte ethernet packet. That inefficiency is what really hurt. Sadly the ATM legacy lives on in GPON and XGSPON -- EPON / 10GEPON are far better protocols. As a result, GPON and XGSPON require gobs of memory per port for frame reassembly (128 ONUs x 8 priorities x 9KB for jumbo frames = 9MB per port worst case), whereas EPON / 10GEPON do not.

MPLS also has certain issues that are solved by using the IPv6 next header feature which avoids having to push / pop headers (modifying the size of the packet which has implications for buffering and the associated QoS issues making the hardware more complex) in the transport network. MPLS labels made sense at the time of introduction in the early 2000s when transport network hardware was able to utilize a small table to look up the next hop of a frame instead of doing a full route lookup. The hardware constraints of those early days requiring small SRAMs have effectively gone away since modern ASICs have billions of transistors which make on chip route tables sufficient for many use-cases.

hylaride · 2025-11-16T00:21:48 1763252508

> Circuit switching is not harder to do, it's simply less efficient

I did specify more expensive. Even with ASICs it’s more expensive to scale up.

> ATM took this to an extreme with nearly 10% protocol overhead (48 bytes of payload in a 53 byte cell) and 22 bytes of wasted space in the last ATM cell for a 1500 byte ethernet packet. That inefficiency is what really hurt. Sadly the ATM legacy lives on in GPON and XGSPON -- EPON / 10GEPON are far better protocols. As a result, GPON and XGSPON require gobs of memory per port for frame reassembly (128 ONUs x 8 priorities x 9KB for jumbo frames = 9MB per port worst case), whereas EPON / 10GEPON do not.

ATM was a technology designed and pushed by the traditional voice telecoms long before everything converged on IP. The smaller byte sizes were designed to have less jitter/latency for voice, QoS (also to prioritize voice), and more fine-grained multiplexing for voice traffic, where other data was secondary. ATM was originally designed in the late 1980s before Ethernet won out and bulk data transfer was a novelty; the traditional telecoms wanted something that mapped into their current circuits.

ATM’s overhead for IP was arguably too much, but it wasn’t what killed ATM. It was the fact that speeding the sending of such small packets up to and past gigabit speeds was too expensive, which was made worse by the fact that Ethernet became a commodity due to scale.

> Sadly the ATM legacy lives on in GPON and XGSPON -- EPON / 10GEPON are far better protocols. As a result, GPON and XGSPON require gobs of memory per port for frame reassembly (128 ONUs x 8 priorities x 9KB for jumbo frames = 9MB per port worst case), whereas EPON / 10GEPON do not.

This isn’t quite the same argument. I agree that 10GEPON is “better” and I wish it was THE standard (especially for pure ISPs), but it ignores that most ISPs using XGSPON are multiplexing voip, TV, emergency, and other traditional networks. If anything, it’s what ATM should have been, where they could have had a different priority for small celled VOIP and larger packets for other services. I say this as a person who thinks all this is dumb and consumers should just be given an internet connection at this point - I hope 10GEPON wins out in the end - it is certainly already cheaper. I much more hate the fact that I have to use PPPoE than some memory overhead for the ports reassembling GEM packets, though.

As for MPLS, well yeah it was certainly faster than IP lookups, but the circuits also often result in sub-optimal routing as the MPLS “tunnels” don’t always reflect otherwise ideal physical paths. IIRC, about half of the IPv6 internet’s core is actually routed over MPLS tunnels and it can be a large reason IPv6 routing can often have higher TTLs than IPv4 (because the paths often aren’t as efficient). That being said, we’ll have to see if segment routing takes off, and what approaches stick.

bcrl · 2025-11-14T01:43:27 1763084607

Maybe people shouldn't be building data centers in deserts. In the city of Toronto the Deep Lake Water Cooling System uses water taken from Lake Ontario used for drinking water to cool a number of buildings. Most notably 151 Front Street West which houses the data centers routing most of the Internet in Ontario.

bcrl · 2025-11-14T01:28:01 1763083681

I know one company that strove for five sixes.

bcrl · 2025-11-14T01:22:04 1763083324

MANET is one of the protocols I was involved in implementing for a certain network protocol suite back around 2012. Mesh routing protocols only work for the most limited of use cases. They don't know about the capacity of the underlying wireless network and basically fall apart when things are congested or there are radios with poor reception. QoS is implemented far better in modern cell phone networks, and if the routing protocol doesn't take QoS into account, it's gonna suck.

digdugdirk · 2025-11-14T16:08:03 1763136483

Interesting! If someone with a math background (but not a CS background) wanted to dig deep into learning about mesh networking protocol theory - do you have any recommendations for learning resources or places to get started?

I've long imagined that a content centric mesh network approach would be a better starting point than what we've built up currently, but it seems like such a deep and mysterious subject and I have no idea where to even begin to get started.

bcrl · 2025-11-15T20:28:06 1763238486

I never followed where things went after the contract was complete. Suffice it to say that we only cared about getting the protocol working, as the company was a contract engineering firm doing work for a product that was ultimately for military use. Actually testing it in the real world and improving behaviour was out of scope. We only tested it in a simulated network to make sure that the protocol correctly handled various cases (like certain wireless nodes not being able to see each other due to obstructions).

I had other friends back in the early 2000s working on WiMax, and the hardest part of their work was getting QoS right. More recently (still 10+ years ago), another friend implemented a TCP proxy for a major cell phone provider in the US that used a more wireless friendly congestion control protocol on the wireless network side of things as regular TCP breaks down when latency increases due to reception issues (which gets interpreted as congestion and triggers retransmits). Since the cellular base stations ensured that the wireless network was effectively lossless (albeit with periods of much higher latency), performance for end users increased substantially when the bulk of the TCP retransmits were suppressed.

There's a huge gap between making wireless work vs making it work well. For me, 5G is a step backwards as all the tricks used to push for higher data rates (like larger QAM constellations) make everything worse in rural areas with poor reception: there just isn't a good enough SNR 99% of the time for the new shiny, and the increased power usage does nothing other than drain my phone's battery faster than it did with older LTE. But that is where all the money for research is today.

Wireless is complicated.

0x457 · 2025-11-18T21:29:47 1763501387

It highly depends on where you plan to run your mesh on. Meshes that one run on wired networks (be it copper or fiber) are vastly different from meshes that work on radio waves. Even more different when it comes to this radio spectrum.

If you're talking about fast and low latency connection then look into existing meshes, almost every popular mesh has some sort of paper describing how it works.