Convert string to array for new and old data in an index

Our current Audit log entries contain a field called Ipaddress, which is a string IP address. Some services send this value as a single IP address while some services are sending it as an array with multiple comma separated IP addresses. Handling and querying these IP addresses in Elasticsearch is currently inefficient and problematic due to the field being a single string.
Our audit schema has defined this as string. We need to change the data type for it. This change will allow us to efficiently manage and search multiple proxy IP addresses within the array, facilitating better analysis of traffic across multiple proxies.

Some clarifications required.

  1. Is it better to convert the Ipaddress string into an array or IP datatype?
  2. Since we already have data in our audit-log index where some data is getting stored as string and some as array and when we try to query them ES api gives 500 due to serialization error. What will be a better approach - change the string type to array/IP in the schema(code) or convert the Ipaddress string to new datatype via logstash or both?
  3. For the existing Logs as String in ES, will re-index be required?
  4. Logs with SourceIp already in array format should remain unchanged

We also have another similar use case where since we receive logs from various services, some passes the value for a field as a boolean (true or false), and others pass as a string ("True" or "False"). This type mismatch leads to issues in Elasticsearch, preventing it from storing such logs due to inconsistent data types. In our schema it was defined as String.

So does this mean for both of the above scenarios it will involve updating our Logstash pipeline to convert to appropriate data type? or/and this should be supported with Schema changes? what will be the best way to handle existing data stored in the audit-index to normalize the data?

Thanks

Hi @Moni_Hazarika,

Here are some suggestions you can try:

  1. It is always recommended to convert your IP into ip or ip_range datatype so that you can perform range query.
  2. For existing logs - You can reindex with proper data type. Also you can give a try to runtime fields.
  3. Yes if you're making new index, Reindex will be needed or as suggested you can give a try to runtime fields.
  4. I think better to change all type of IPs field in ip data type. Even you can store array.

Regarding Boolean value - Either you can use lowercase processor or you can modify on logstash pipeline. Idea is to maintain consistent format across the data.