If you've got big offline edits (or you're merging multiple large sets of edits), even existing CRDTs will generally handle that more efficiently than OT will. OT algorithms are usually O(n * m) time complexity when merging n edits from one peer with m edits from another peer. A CRDT like diamond-types is O((n + m) * log(s)) where s is the current size of the document. In practice its super fast.
As for holding deleted states and richer information per unit, its not so bad in absolute terms. 1-2mb of data in memory for a 17 page document is honestly fine. But there's also a few different techniques that exist to solve this in CRDTs:
1. Yjs supports "garbage collection" APIs. Essentially you say "anything deleted earlier than this point is irrelevant now" and the data structures will flatten all runs of items which were deleted earlier. So storage stays proportional to the size of the not-deleted content.
2. Sync9 has an algorithm called "antimatter" which mike still hasn't written up poke poke mike!. Antimatter actively tracks the set of all peers which are on the network. When a version is known to have been witnessed by all peers, all extra information is safely discarded. You can also set it up to assume any peer which has been offline for however long is gone forever.
3. Personally I want a library to have an API method for taking all the old data and just saving it to disk somewhere. The idea would be to reintroduce the same devops simplicity of OT where you can just archive old history when you know it probably won't ever be referenced again. Keep the last week or two hot, and delete or archive history at will. If you combined this with a "rename" operation, you could reduce the "hot" dataset to basically nothing. This would also make the implementation much simpler - because we wouldn't need all these performance tricks to make a CRDT like diamond-types fast if the dataset stayed tiny anyway.
As for holding deleted states and richer information per unit, its not so bad in absolute terms. 1-2mb of data in memory for a 17 page document is honestly fine. But there's also a few different techniques that exist to solve this in CRDTs:
1. Yjs supports "garbage collection" APIs. Essentially you say "anything deleted earlier than this point is irrelevant now" and the data structures will flatten all runs of items which were deleted earlier. So storage stays proportional to the size of the not-deleted content.
2. Sync9 has an algorithm called "antimatter" which mike still hasn't written up poke poke mike!. Antimatter actively tracks the set of all peers which are on the network. When a version is known to have been witnessed by all peers, all extra information is safely discarded. You can also set it up to assume any peer which has been offline for however long is gone forever.
3. Personally I want a library to have an API method for taking all the old data and just saving it to disk somewhere. The idea would be to reintroduce the same devops simplicity of OT where you can just archive old history when you know it probably won't ever be referenced again. Keep the last week or two hot, and delete or archive history at will. If you combined this with a "rename" operation, you could reduce the "hot" dataset to basically nothing. This would also make the implementation much simpler - because we wouldn't need all these performance tricks to make a CRDT like diamond-types fast if the dataset stayed tiny anyway.