Honestly, the most amazing thing I did with logs was learn how to do subtraction...

10000truths · on Dec 13, 2022

Isn’t this basically what structured binary logs are? Instead of writing a string like `Error: foo timeout after ${elapsed_amt} ms` to the log, you write a 4-byte error code and a 4 byte integer for elapsed_amt. I know there are libraries like C++’s nanolog that do this for you, under the hood.

nerdponx · on Dec 13, 2022

Maybe this is a silly question, but is there much value in a 4-byte binary code compared to a human-readable log with human-readable codes? Maybe size, but logfmt especially is not much less compact than binary data.

jldugger · on Dec 14, 2022

Well, the value is mostly in query writing, but at that point you're basically writing a very bad TSDB?

jldugger · on Dec 13, 2022

That solves the fingerprint, but you still need to count and score.

nerdponx · on Dec 13, 2022

You might be interested in the TF-IDF algorithm used in information retrieval and text classification.

jldugger · on Dec 13, 2022

Yes, this is pretty much TF-IDF for people too lazy to count the number of unique items in the corpus.

Since that number should be the same (or at least close!) in both good and bad datasets, I'm not sure the extra math matters much.