Schema-less records support

Hi,

I use elasticsearch to store logs collected by fluentd within Kubernetes.

One of the pain points I keep hitting time and time again is elasticsearch rejecting logs because the field types change between log messages.

For example, one app logs {"level":40, "levelName": "INFO"} and another one logs {"level": "INFO" }. In this case elasticsearch will reject the second of these log messages for not matching the schema it inferred from the first.

I don't like it when I lose log messages, since I might need those logs for something.

In my ideal universe elasticsearch could operate in a schema-less manner like mongodb does - basically all fields are of the same variant type by default and there's no conflict. However, it doesn't seem to be the direction elasticsearch is going in.

I do wonder if there's some hack I could use to get a similar effect, though. Perhaps a mapping that automatically renames fields to avoid type conflicts, like if it's a number then add "number" to the name or something along those lines.

Has anyone found a good solution for this problem?

Dobes,
If you have not explicitly blocked dynamic mapping, ElasticSearch will create new field when it sees it in the document for the first time. So it does not require schema defined upfront like RDBMS.

But it needs to understand data type for a good reason. Integer 2 is less than integer 10. But string "2" greater than string "10". Indexing with wrong data type can cause problems with queries. Encoding and storing differ between different data types too.

If it's okay for your use case to store everything as a string, you can define dynamic mapping for the index and map all new fields to keyword or text and you will not lose any data. See Dynamic Mapping

You can also use ingestion pipeline to rename fields and change field "level" to "level_int" vs "level_string" and use dynamic mapping to set the type based on field name pattern. But inspecting every field of every document can be very expensive.

Hi,

The solution to sorting between strings & integers is already solved in mongodb (just as an example). It sorts by type first, then value. So elasticsearch can just select an arbitrary sort order, as they did.

I do not want to convert everything to a string. We log status codes, metrics, and other numeric values. Even if it were just about strings, booleans, and numbers that might be OK, there's also objects to consider, where sometimes a value comes as just a string and sometimes as an object with fields of its own.

I have thought about the renaming approach but it seems like a bit of a big project, and it makes the search experience less pleasant due to having to use those suffixes all the time. But in the end it might be the only workable approach.

I agree this is not easy and expensive.

But I feel MongoDB approach may be too lenient on bad data in certain use cases. For ex. a bug in the upstream code shifted all fields to left and generated {"level": "INFO" } instead of {"level":40, "levelName": "INFO"}. In this case mongodb logic will silently load bad data for all fields. Once I figure out the issue, how can I identify bad documents.

With elasticsearch if there's a change to logging that causes a type conflict, those logs are all thrown away, which is worse than the mongodb logic where you at least still have the "bad" documents.

For me, it's more important to keep all the logs. The truth is, for most fields in a log I don't even care that much about searching them. Perhaps if elasticsearch would still accept a type conflict but just mark those fields unsearchable or rename the invalid fields only that would solve it.

ElasticSearch rejects bad data but with status code, in this case 400, and a reason. For ex. bulk api will insert all valid documents in the batch and rejects invalid ones. From response I can find out invalid documents and insert in Kafka or a separate ElasticSearch index that has one field reason and dynamic mapping turned off. Those can be investigated and re-ingested. There is no data loss.

Logstash handles it using dead-letter queue https://www.elastic.co/guide/en/logstash/current/dead-letter-queues.html

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.