Comparing and Unifying fields across documents

Hello!

I have an Elasticsearch Index, where a lot of request calls are being logged from our Company's API. So all the documents have "send" and "response" objects. The content of these objects changes based on the API endpoint being accessed.

The issue I am facing is, the way the engineers have designed the backend systems. Some API calls the field "abc.xyz.HomeAnalysis", some call it, "abc.xyz.homeanalysis" and some others call it "abc.xyz.home_analysis".

Essentially all three could be bundled as "abc.xyz.HomeAnalysis". Currently, this is leading to a burst in number of Indexed fields ~1000+. But that could be cut into half, just by merging fields with different names into one unified field.

Ps. One document within the index will only have ONE out of the list of possible naming structures defined above.

What is the best way to handle something like this?

One way I thought of, was defining painless Scripts in the Ingestion pipeline that will do that for me. But that seems like a vary manual process. I was hoping there was a better way to handle this in elasticsearch.

Sample documents. The field I am talking about: send.actionModels.Body.Application.hasEc --> This field has different field name in all three documents.

Document1

{
  "createdAt": "2023-05-08T19:24:08.3473356Z",
  "ipGeo": {
    "continent_name": "North America",
    "region_iso_code": "US-IA",
    "city_name": "Des Moines",
    "country_iso_code": "US",
    "country_name": "United States",
    "region_name": "Iowa",
    "location": {
      "lon": -93.6124,
      "lat": 41.6021
    }
  },
  "controllerName": "RaterV3",
  "send": {
    "actionModels": {
      "Body": {
        "Application": {
          "lastName": "Gruss",
          "constructionType": "wood",
          "city": "Port St. Joe",
          "deliveryMethod": "electronic",
          "hasEc": "false",
          "contentCoverage": "80000"
        }
      }
    }
  },
  "actionName": "PostQuote",
  "isIpBlock": false,
  "elapsedMS": 381,
  "responseCode": 200
}

Document2

{
  "createdAt": "2023-05-08T19:24:08.3473356Z",
  "ipGeo": {
    "continent_name": "North America",
    "region_iso_code": "US-IA",
    "city_name": "Des Moines",
    "country_iso_code": "US",
    "country_name": "United States",
    "region_name": "Iowa",
    "location": {
      "lon": -93.6124,
      "lat": 41.6021
    }
  },
  "controllerName": "RaterV3",
  "send": {
    "actionModels": {
      "Body": {
        "Application": {
          "lastName": "Gruss",
          "constructionType": "wood",
          "city": "Port St. Joe",
          "deliveryMethod": "electronic",
          "has_Ec": "false",
          "contentCoverage": "80000"
        }
      }
    }
  },
  "actionName": "PostQuote",
  "isIpBlock": false,
  "elapsedMS": 381,
  "responseCode": 200
}

Document3

{
  "createdAt": "2023-05-08T19:24:08.3473356Z",
  "ipGeo": {
    "continent_name": "North America",
    "region_iso_code": "US-IA",
    "city_name": "Des Moines",
    "country_iso_code": "US",
    "country_name": "United States",
    "region_name": "Iowa",
    "location": {
      "lon": -93.6124,
      "lat": 41.6021
    }
  },
  "controllerName": "RaterV3",
  "send": {
    "actionModels": {
      "Body": {
        "Application": {
          "lastName": "Gruss",
          "constructionType": "wood",
          "city": "Port St. Joe",
          "deliveryMethod": "electronic",
          "hasec": "false",
          "contentCoverage": "80000"
        }
      }
    }
  },
  "actionName": "PostQuote",
  "isIpBlock": false,
  "elapsedMS": 381,
  "responseCode": 200
}

One way I can think of is using an ingest pipeline rename processor. As the data is coming in, it can rename the fields to a target field name that will be referenced in the index. So if you chose abc.xyz.home_analysis as the standard name, then when Elastic comes across abc.xyz.homeanalysis or abc.xyz.HomeAnalysis, they both can be renamed to abc.xyz.home_analysis in your mapping.

Here's more info about it in the docs.

2 Likes

Thank you for your response @Alexis_Roberson. Yeah, so currently that is the way I am doing it, I do the building of the Ingest Pipeline based on sample data (Assuming that represents the whole population)

I create a "Rename" processor for each field with this issue. But that is leading to more than ~300 processors in total within the ingestion pipeline. Now while that works for a small sample data I am playing with, I am very concerned it might load my ingest nodes heavily once I do it in production.

So I am trying to see if there was a better way to achieve this within the Elastic realm.

I don't think there is another way, you would need to rename every field to normalize the name between the difference sources.

If this will increase the load of your ingest nodes or not it is not possible to know upfront, you would need to test it.

The best solution is to fix the name of the fields before sending them to Elasticsearch, so you would need to make changes on your systems to make them use the same name for the fields.

2 Likes

Thank You, @leandrojmp .

I will stress test this to see how it affects the ingest.

As far as the changes to the system go, I wish that was a possibility, would be soo much easier. Unfortunately, I do not have any control over that team. All of the logs ingest happens after the fact at this point, So I guess I will work with what I got.

Cheers!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.