Honestly, the most amazing thing I did with logs was learn how to do subtraction. Any time you have multiple instances of a thing and only some of them are bad, you can easily find the problem (if anyone bothered to log it) by performing bad - good.
The way you do this is by aggregating logs by fingerprints. Removing everything but punctuation is a generic approach to fingerprinting, but is not exactly human friendly. For Java, log4j can use class in your logging pattern, and that plus log level is usually pretty specific.
Once you have a fingerprint, the rest is just counting and division. Over a specific time window, count the number of log events, for every finger print, for both good and bad systems. Then score every fingerprint as (1+ # of bad events) / (1 + # of good events) and everything at the top is most strongly bad. And the more often its logged, the further up it will be. No more lecturing people about "correct" interpretations of ERROR vs INFO vs DEBUG. No more "this ERROR is always logged, even during normal operations".
Isn’t this basically what structured binary logs are? Instead of writing a string like `Error: foo timeout after ${elapsed_amt} ms` to the log, you write a 4-byte error code and a 4 byte integer for elapsed_amt. I know there are libraries like C++’s nanolog that do this for you, under the hood.
Maybe this is a silly question, but is there much value in a 4-byte binary code compared to a human-readable log with human-readable codes? Maybe size, but logfmt especially is not much less compact than binary data.
The way you do this is by aggregating logs by fingerprints. Removing everything but punctuation is a generic approach to fingerprinting, but is not exactly human friendly. For Java, log4j can use class in your logging pattern, and that plus log level is usually pretty specific.
Once you have a fingerprint, the rest is just counting and division. Over a specific time window, count the number of log events, for every finger print, for both good and bad systems. Then score every fingerprint as (1+ # of bad events) / (1 + # of good events) and everything at the top is most strongly bad. And the more often its logged, the further up it will be. No more lecturing people about "correct" interpretations of ERROR vs INFO vs DEBUG. No more "this ERROR is always logged, even during normal operations".