Comparing and Unifying fields across documents

obhive · May 25, 2023, 5:21pm

Hello!

I have an Elasticsearch Index, where a lot of request calls are being logged from our Company's API. So all the documents have "send" and "response" objects. The content of these objects changes based on the API endpoint being accessed.

The issue I am facing is, the way the engineers have designed the backend systems. Some API calls the field "abc.xyz.HomeAnalysis", some call it, "abc.xyz.homeanalysis" and some others call it "abc.xyz.home_analysis".

Essentially all three could be bundled as "abc.xyz.HomeAnalysis". Currently, this is leading to a burst in number of Indexed fields ~1000+. But that could be cut into half, just by merging fields with different names into one unified field.

Ps. One document within the index will only have ONE out of the list of possible naming structures defined above.

What is the best way to handle something like this?

One way I thought of, was defining painless Scripts in the Ingestion pipeline that will do that for me. But that seems like a vary manual process. I was hoping there was a better way to handle this in elasticsearch.

Sample documents. The field I am talking about: send.actionModels.Body.Application.hasEc --> This field has different field name in all three documents.

Document1

{
  "createdAt": "2023-05-08T19:24:08.3473356Z",
  "ipGeo": {
    "continent_name": "North America",
    "region_iso_code": "US-IA",
    "city_name": "Des Moines",
    "country_iso_code": "US",
    "country_name": "United States",
    "region_name": "Iowa",
    "location": {
      "lon": -93.6124,
      "lat": 41.6021
    }
  },
  "controllerName": "RaterV3",
  "send": {
    "actionModels": {
      "Body": {
        "Application": {
          "lastName": "Gruss",
          "constructionType": "wood",
          "city": "Port St. Joe",
          "deliveryMethod": "electronic",
          "hasEc": "false",
          "contentCoverage": "80000"
        }
      }
    }
  },
  "actionName": "PostQuote",
  "isIpBlock": false,
  "elapsedMS": 381,
  "responseCode": 200
}

Document2

{
  "createdAt": "2023-05-08T19:24:08.3473356Z",
  "ipGeo": {
    "continent_name": "North America",
    "region_iso_code": "US-IA",
    "city_name": "Des Moines",
    "country_iso_code": "US",
    "country_name": "United States",
    "region_name": "Iowa",
    "location": {
      "lon": -93.6124,
      "lat": 41.6021
    }
  },
  "controllerName": "RaterV3",
  "send": {
    "actionModels": {
      "Body": {
        "Application": {
          "lastName": "Gruss",
          "constructionType": "wood",
          "city": "Port St. Joe",
          "deliveryMethod": "electronic",
          "has_Ec": "false",
          "contentCoverage": "80000"
        }
      }
    }
  },
  "actionName": "PostQuote",
  "isIpBlock": false,
  "elapsedMS": 381,
  "responseCode": 200
}

Document3

{
  "createdAt": "2023-05-08T19:24:08.3473356Z",
  "ipGeo": {
    "continent_name": "North America",
    "region_iso_code": "US-IA",
    "city_name": "Des Moines",
    "country_iso_code": "US",
    "country_name": "United States",
    "region_name": "Iowa",
    "location": {
      "lon": -93.6124,
      "lat": 41.6021
    }
  },
  "controllerName": "RaterV3",
  "send": {
    "actionModels": {
      "Body": {
        "Application": {
          "lastName": "Gruss",
          "constructionType": "wood",
          "city": "Port St. Joe",
          "deliveryMethod": "electronic",
          "hasec": "false",
          "contentCoverage": "80000"
        }
      }
    }
  },
  "actionName": "PostQuote",
  "isIpBlock": false,
  "elapsedMS": 381,
  "responseCode": 200
}

Alexis_Roberson · May 26, 2023, 4:31pm

One way I can think of is using an ingest pipeline rename processor. As the data is coming in, it can rename the fields to a target field name that will be referenced in the index. So if you chose abc.xyz.home_analysis as the standard name, then when Elastic comes across abc.xyz.homeanalysis or abc.xyz.HomeAnalysis, they both can be renamed to abc.xyz.home_analysis in your mapping.

Here's more info about it in the docs.

obhive · May 26, 2023, 4:49pm

Thank you for your response @Alexis_Roberson. Yeah, so currently that is the way I am doing it, I do the building of the Ingest Pipeline based on sample data (Assuming that represents the whole population)

I create a "Rename" processor for each field with this issue. But that is leading to more than ~300 processors in total within the ingestion pipeline. Now while that works for a small sample data I am playing with, I am very concerned it might load my ingest nodes heavily once I do it in production.

So I am trying to see if there was a better way to achieve this within the Elastic realm.

leandrojmp · May 26, 2023, 5:14pm

I don't think there is another way, you would need to rename every field to normalize the name between the difference sources.

If this will increase the load of your ingest nodes or not it is not possible to know upfront, you would need to test it.

The best solution is to fix the name of the fields before sending them to Elasticsearch, so you would need to make changes on your systems to make them use the same name for the fields.

obhive · May 26, 2023, 8:05pm

Thank You, @leandrojmp .

I will stress test this to see how it affects the ingest.

As far as the changes to the system go, I wish that was a possibility, would be soo much easier. Unfortunately, I do not have any control over that team. All of the logs ingest happens after the fact at this point, So I guess I will work with what I got.

Cheers!

system · June 23, 2023, 8:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Copy document if fields are different Elasticsearch painless	1	224	May 30, 2023
Painless scripting - how to compare two different fields present in two different documents of an index Elasticsearch	3	1901	January 12, 2018
Document in ES having multiple source objects with same name Elasticsearch	2	335	July 6, 2017
Rename field names while Indexing a document Elasticsearch	1	428	November 14, 2019
Multiple fields with different values in a same document Elasticsearch	6	593	July 5, 2017

Comparing and Unifying fields across documents

Related topics