Hi everyone,
We are currently shipping our application logs directly to Elasticsearch, and we are trying to find a way to analyze our error patterns more effectively. Specifically, we need to:
-
Identify the uniqueness of our errors (grouping identical types of errors together).
-
Find out the exact occurrence count for each unique error type.
The Problem: If we do a standard terms aggregation on our raw message or text fields, every log line looks completely unique. This is because our log strings contain highly dynamic variables that change on every request, such as:
-
User/Tenant IDs (e.g., UUIDs)
-
Dynamic database/row numeric IDs
-
IP addresses and hostnames
-
Hex strings or timestamps embedded in the message
Because of this high cardinality, our dashboards are flooded with thousands of individual buckets instead of showing us the top 5 or 10 structural errors that are actually breaking our application.
Our Goal: We want a clean way to strip away or ignore these dynamic variables so Elasticsearch can recognize that Connection timed out to database-123 and Connection timed out to database-456 are actually the exact same error, and count them together as 2 occurrences.
My questions for the community:
-
What is the standard industry practice or architecture for grouping unstructured logs by error patterns inside the Elastic Stack?
-
Should this deduplication/normalization be handled during ingestion (via an Ingest Pipeline or Logstash), or is there a way to do this at query time/inside Kibana?
-
Are there any native features (like Machine Learning or out-of-the-box processors) that handle this automatically without requiring us to manually maintain a massive list of regex patterns?