ES for logs -- field conflict hell 🔥

In a large enterprise with diverse log sources, if you want to provide the benefits of structured logging you could turn on dynamic mapping. But those diverse sources may not be able to coordinate and share schema, and it may be challenging for you to split your logs into different indices by source...

What you end up with is conflicts left and right and lots of dropped documents.

Anyone else run into this situation?

What have you done to deal with this?

I have learned it hard way. and hence I put only structure data in my index.

We have structure in our index.

The problem is that documents have conflicting fields.

As far as I know there are no perfect solutions to this, that is if you want to avoid conflicts completely, have fully dynamic mappings and not enforce any control on the data sent in.

So far we have "solved" this by a combination of static and dynamic. Most indices (logs) have a range of static mappings that are non-negotiable, then what we do is allow some fields to be dynamic, for example we have had the fields kubernetes.* to be dynamic so any combination of subfields can be used.

Our general purpose for logs is security and inspection through security rules, as such there are limits to what kind of dynamics we can allow (since it would be impossible to do security rules on fully dynamic indices without any control of the possible naming schemes.)

Thanks for your input, Joel.

Our use case is application diagnostics. And with effectively thousands of different "services" logging, it's hard to control what they log.

Power in structure: We want to enable them to have the power of structured logs (and the power of analyzing logs from all the various services together, so they're all in the same index), but now we have to manage the conflicts.

Sharing schema: We're encouraging use of ECS, and working on building out custom fields atop ECS to help the services share a schema, but development and adoption are slow.

One way: collapse to text: Right now we're aiming to resolve conflicts by converting conflict documents to serialized JSON and putting that into message. That will keep the documents from just getting dropped, which has been a big pain for us. It also keeps the data mostly searchable. But it doesn't preserve structure on the conflicting logs.

Given the hard constraint of indices having one schema, I don't think there is a perfect way to handle these conflicts.

What kind of mapping conflicts are you having?

If you set index.mapping.ignore_malformed to true in your indices templates you can make elasticsearch reject only the conflicting field and not the entire document, but this does not work for all kinds of conflicts, it depends on the data type.

Unfortunately the most common mapping conflict is when a field is an object, but the mapping expects a concrete value and vice-versa, this kind of mapping conflict will make Elasticsearch reject the entire document even with ignore_malformed as true.

Another option that can help is to use dynamic templates to tell how elasticsearch should map fields based on their value type, string or numeric for example, or based on some pattern for the field name.

1 Like

In addition to @leandrojmp excellent suggestions....

In general, if you use 8.x newer releases.

Us the proper datastream naming convention

logs-<datastream.dataset>-<datastream.namespace>

Many of the issue above have a better outcome.

AND @rsk0 @leandrojmp @elasticforme

I can not comment on the time frame, but I understand real design work is going on behind the scenes to solve the mapper parsing rejection. I can not comment more at this time, but I believe it is being actively worked on and will be a significant plus when released.

I had an additional comment, you write "and the power of analyzing logs from all the various services together, so they're all in the same index", but have you looked at combing sources through data views instead?

I mean, if you combine similar sources into the same indices you minimize the risk of conflicts, and then you create dataviews that combine all the relevant indices through wildcards so that when searching you only search on one dataview.

If you set index.mapping.ignore_malformed to true in your indices templates you can make elasticsearch reject only the conflicting field and not the entire document, but this does not work for all kinds of conflicts, it depends on the data type.

Our conflicts are either concrete v. concrete or concrete v. object in roughly even amounts. I would imagine that the frequency of these two major categories of conflict would be heavily dependent on the nature of the data folks are working with, but you're right that for us concrete v. object is very common.

So, yeah, I think you're right that ignore_malformed wouldn't serve us well since it would fail to deal with half our issues, let alone the fact that ignore_malformed tosses out the data in the field when we'd rather preserve it if at all possible.

Another option that can help is to use dynamic templates to tell how elasticsearch should map fields based on their value type, string or numeric for example, or based on some pattern for the field name.

I think dynamic templates could be useful in some scenarios, but if we've got customer A sending in {"status":200} and customer B sending in {"status":"OK"}, I'm not clear on how dynamic templates could help here.

In general, if you use 8.x newer releases.
Us the proper datastream naming convention
logs-<datastream.dataset>-<datastream.namespace>
Many of the issue above have a better outcome.

Ah, I think this is related to the idea of placing documents into distinct indices to avoid conflicts. Since we don't have information on the incoming documents to identify schema (unless we look at actual fields used) I don't think we could make use of the dataset.

I can not comment on the time frame, but I understand real design work is going on behind the scenes to solve the mapper parsing rejection. I can not comment more at this time, but I believe it is being actively worked on and will be a significant plus when released.

:eyes:

Well, that's exciting!

Part of why I encourage my company to stick with Elastic as a platform is the proven history of feature improvements and the expectation that various pain points we have will keep getting ironed out. Field conflicts have been a thorn in our side for a while, and it's causing grumblings about the choice of data store.

I have no idea how common our scenario of logs + mixed schema + conflicts is among Elastic users, but thinking about what led us here I have to imagine there must be a sizable contingent of users in the same boat, and so Elastic must care about our use case and be interested in making it work better. It would have been good to get a sense of how much Elastic cares here, especially as internal detractors have been growing more vocal. Maybe a product manager could chime in.

But it does indeed sound by what you're saying that this particular issue of conflicts is getting more love. I'm absolutely delighted to hear it. :heart:

Meanwhile, we're implementing now (via Logstash DLQ) a subset of what I expect Elastic is working on.

Our Solution, Phase 1

As a first pass we'll simply be taking any docs with conflicts and stuffing their entire JSON source into message. This keeps docs from getting dropped, and the contents remain searchable. We lose structure, of course, so we can't point at fields specifically, or use them according to their types (numerics, keywords).

An alternative here is to convert the entire doc to a flattened type and shove it in, say, _document_flattened, which should make the individual fields available for searching or querying distinctly, which is nice, and preserves an array of operations. However, our customers wouldn't be able to use those fields in their original locations and wouldn't be able to use all query types / aggregations.

Our Solution, Phase 2

This will be a more "articulated" form of reprocessing where we preserve all the original structure except the conflicts, and move the conflicts to either new fields (a conflicting {"status":"OK"} might be shunted to status_text eg) or append them to a common dumping field for conflicts.

I would love to put in a feature request for automatic handling of conflicts via index-level config (settings.index.mapping.conflicts.rename?) ... But only if something like this isn't already in the works.

BTW @rsk0 I pinged our internal engineering to read through this / your solutions ... as I said we are working hard to make this much easier. Great Feedback!

1 Like

have you looked at combing sources through data views instead

We thought about it, but there was a lot of complexity around how to make it work, stemming from how data views generate errors when viewing across conflicting indices in Discover.

But this is the main goal for this setting, to reject just the conflicting field and not the entire document so you can check why the field was reject and fix the mapping.

It really depends on each case and what you want to solve, you may create a dynamic template to map every numeric dynamic field as a keyword and also map every string dynamic field as a keyword.

I would say that this kind of mapping conflict errors are pretty common, if you search on the forum you will find a lot of topics about this, but in my opinion the root cause for this is a misconception on how Elasticsearch works.

Elastiscearch is not schema-less, it is schema-on-write and has some features to also have schema-on-read (like runtime fields), but having a schema (the mappings) it is a requirement.

There are some ways to avoid the issues of conflicting mappings, but in the end the only solution is to fix the mappings and reduce the number of dynamic fields.

This is a lot of work, but using ECS as a reference helps a lot, you can fix this on the source, like talking with the internal dev teams to adopt a common logging schema, or during ingestion, by parsing the source message and renaming the fields.

Thanks for raising this. Avoiding document rejections is something we deeply care about and have invested quite a bit in the recent past, with some initiative still ongoing.

  • Setting index.mapping.ignore_malformed: true is a good start. Our default index template for logs-*-* sets this by default now.
  • Mapping explosions due to having too many fields is often also an issue for dynamic mappings. We've worked on a feature that ignores fields that go beyond the limit instead of rejecting documents. The index.mapping.total_fields.ignore_dynamic_beyond_limit flag will be released in 8.13 and is also added by default for logs data streams.
  • As mentioned in this thread, object/scalar conflicts aren't handled by ignore_malformed. Instead of ignoring these conflicts, we've improved the support for subobjects: false. You can set this at the root of your mapping and Elasticsearch will be able to ingest a document like this: {"foo": "bar", "foo.bar": "baz" }. The best part is that both fields are indexed successfully and no data is dropped or ignored. We've added support for auto-flattening documents and mappings. This means that you can still send documents that use nested objects as opposed to dotted field names, such as { "foo": { "bar": "baz" } }. Also, you don't need to change your mappings to use dotted field names (this is coming in 8.14). The tradeoff is that the nested field type is not supported when setting subobjects to false.
  • We're also working on a failure store that stores documents that failed because of mapping issues (which should be more rare now) or if the ingest pipeline fails. You can't fully search these failed documents as they're stored as blobs but you can analyze the reason for why they've failed and you can look at the full document source.
  • Another small enhancement for indexing logs to logs-*-* is that we assign a default timestamp if none was provided.

We're also working on UIs to make it easier to detect and remediate ignored fields and rejected documents (those that are in the failure store).

Let me know if this resonates and whether you have any other suggestions or feedback.

4 Likes

"Fix the mappings" is a bit concerning, but maybe we're still on the same page...

If you want to provide the benefits of structured logs but can't dictate all incoming schema, you will always have the possibility of conflicts.

In a large enough organization, you can't dictate all schema. You're going to be collecting logs from numerous services developed internally and from infrastructure services based on software that has its own structure outside of deploying teams' control. Anyway, in the name of flexibility, you can't dictate all schema, saying "You can't have logs for your service until you to conform." That breaks the ability to rapidly onboard new services or new information logged by existing services. So I think many customers are going to be in the situation where unexpected schema is a part of their lives, or at least where pre-emptively coding up schema in ES is practically impossible or at least very burdensome, and I imagine that dynamic mapping was created for such situations.

If by "fix the mappings" you mean do what you can to get customers who have control over their schemas to work together to unify, I agree that that's an important part of the overall solution, and it does "reduce the number of dynamic fields". But that's not a complete solution because there will always be a contingent that will conflict.

It makes me happy -- Felix -- that Elastic are working on features to ease coping with conflicts and rejections. Much appreciated. I like the additional flexibility and options these features bring to the tool chest. (I don't mean to disrespect the effort and improvement by not responding to the features individually. They really sound great.)

(I'm going to have to put some effort into wrapping my head around object/scalar conflicts and how dot names and subobjects work; the promise of not dropping data is very appealing.)

Summarizing: Using Structure And Minimizing Data Loss

I think the whole process might look like this:

  • enable dynamic mapping so you can receive and use structure without needing to know it ahead of time
  • create and grow a common schema for sources to adopt, improving cross-source analysis and reducing field proliferation and conflicts
  • conserve fields
    • encourage conservation techniques like using flattened and noindex
    • block (or transform/flatten) documents that have abusive field counts/depth
  • deal with conflicts, increasingly better over time
    • first, stop dropping documents
      • maybe drop fields using ignore_malformed and/or ignore_dynamic_beyond_limit
      • maybe avoid scalar/object conflicts using subobjects: false
      • maybe transform entire documents when they have conflicting fields (by serializing to JSON into message -- retaining data instead of outright dropping it, maintaining searchability if not full functionality for all types)
    • maybe relocate conflicting fields (eg, if "status": "OK" conflicts with an existing integer type, create a status_text instead -- @felixbarny, what do you think of an ES feature to make this easy?)
  • observe conflicts via _ignored and/or the coming "failure store" feature and try to resolve them upstream via schema dev/adoption or maybe ETL transformation
1 Like

Oh, also,

  • conserve fields
    • consider using multiple indices / streams to cope with large quantities of fields

For example, if you find your logs are eating up over a thousand fields, perhaps split them across multiple series/streams:

  • logs-all_logs-a.2024-03-29 or logs-datastream-all_logs-a gets 1000 fields
  • logs-all_logs-b.2024-03-29 or logs-datastream-all_logs-b gets the overage

Then query across both sets, a and b, using a single data view / index pattern.

I think this should mitigate the "too many fields" problem as mentioned in the mapping limit settings doc, but I'm not entirely sure and would love to have an elastician chime in.

[edit: You would also want to minimize as much as possible field conflicts between the series to avoid Kibana balking.]

[edit: Oh, apparently this has already been suggested by @warkolm! Thanks, Mark!]

1 Like

This is a good summary and aligns with how I'm thinking about it. I acknowledge that we need to get better at documenting and explaining these things, especially the improvements that we've added recently.

If you don't know the structure of a JSON document at all, or if you already know that a particular source produces JSON documents that are very unpredictable, it can be a good idea to map them to a field that's mapped to the flattened field type.

Probably you're aware of it but just to be clear, when fields get ignored, the values aren't lost. You can still look at them by retrieving the _source. However, you won't be able to do aggregations on the ignored field values.

We had similar discussions about being able to accept multiple field types of the same field in the past. We didn't prioritize that for now, however as that would't be trivial to do. Your suggestion regarding the renaming sounds easier to achieve but it may also lead to broken dashboards as this breaks the assumption of how a field is named.

So for now, the main focus was to not drop the document, retain the value of the ignored field, and to improve monitoring and remediation for ignored fields. I suppose it's fine to just retain the value for conflicting fields in a lot of cases. For others, it's important that they're aggregateable as dashboards or even alerts rely on it. To fix that, you'll often need to either fix the problem in the application that emits the log, or introduce some kind of custom processing (for example, using ES ingest pipelines) to fix, translate, or coerce the value of a particular shipper. Basically what you wrote down in your next point.

Yes, this is a good addition. You can use different datasets or namespace to "sandbox" a particular service or set of services. You also don't want to create too many data streams as that can be a burden on Elasticsearch. But separating out related applications into their own data stream is a good practice and limits the blast radius one service can have on others. You can either do that data stream separation directly in the shipper or use the reroute processor to centrally manage the routing conditions.

1 Like

In a large enterprise with diverse log sources

In our experience it is best to have logs with similar structure or source type to share an index / data stream with each other. So if you have many instances of X different applications/services then you'd have X different indices / data streams. This provides a number of benefits

  • Each index / data stream can have its own set of field mappings (static + dynamic). Type conflicts, in our experience, often come from different applications/services using the same field name but different type.

  • Each index, having its own mappings, ends up with fewer total field mappings to avoid the field mapping explosion hazards of using fewer, shared indices. That also keeps some mapping overhead down.

  • Each index / data stream can utilize common or specialized Index Templates that enable different retentions (via ILM), mappings, and Ingest Pipeline configuration.

  • Using the data stream naming convention like logs-{dataset}-{namespace} it is easy to define a Data View that spans the full hierarchy of log sources (ex> logs-*) in order to simplify search and analysis.

Other pro-tips:

  • When you have direct control of the log format then use standard Elastic Common Schema (ECS) field names wherever possible.

  • Create explicit, static mappings for known, expected fields.

  • When you do not have direct control of the log format then use any of the following to normalize the data into a useful structure. Conditioning of data is a common thing to do in any analysis use case.
    ** Filebeat / Elastic Agent Processors
    ** Logstash Filters
    ** Ingest Pipelines

As others have contributed, various tools and options are available to help address issues like this. As much as we'd all prefer for things to effortlessly align well, that is rather rare at this point in time. As more structured logging becomes ingrained and embraced throughout tech stacks, I am hopeful that there would be some organic direction toward common logging structures/fields like ECS describes. Given Elastic's cooperation with OpenTelemetry maybe that is a way to a better future.