Drop user IP from logs after x days

dklenke1 · June 17, 2025, 2:50pm

Hi,
I'm new to elasticsearch and running version 9.0.2 deployed with the elastic operator 3.0.0 to my kubernetes cluster. I ingest access logs from multiple web proxies into elasticsearch. These logs contain the IP of the user that connected to my site. Now I wish to keep these logs in their original form for a certain amount of time and then drop the IP and keep these "anonymized" logs. I am struggling with this a little bit and it feels like I am missing some fundamentals. Here is the approach I came up with so far:

Logs are initially ingested (from fluent-bit) to indices in the format apache-access-<server>-<date>. So for example

apache-access-my-server-2025.06.17

I have an index template that applies an ILM policy to these indices. The policy is very simple and only defines the hot phase rollover conditions and nothing else.

...
      "index": {
        "lifecycle": {
          "name": "apache-access-logs",
          "rollover_alias": "keep-long-term-apache-access"
        }
...

I defined an ingest pipeline that drops the field that contains the users IP. Testing the pipeline with a document from my existing indices works fine

[
  {
    "remove": {
      "field": "host"
    }
  }
]

I have an index template that matches the rollover_alias from step 2 and sets default pipeline as the pipeline from step 3.

Now this currently fails with

"index.lifecycle.rollover_alias [keep-long-term-apache-access] does not point to index [apache-access-server-2025.06.17]"

but beyond the fix for this specific issue this makes me question if my approach is even correct.

stephenb · June 17, 2025, 3:57pm

So where not technically incorrect ... you are trying to use a very old approach.

You should use a datastream not an index... then you would not have this problem.

And BTW the proper naming would be

logs-apache.access-default
<type>-<dataset>-<namespace>

AND if these are apache access logs you could probably automatically parse them ...

But get the data flowing and rolling over first then maybe we can help you

dklenke1 · June 20, 2025, 8:40am

Thanks for your response!
So what I'm taking from this is:

"Manually" dating indices at the point of ingestion is no longer / was never done. Instead use a data stream that rolls over to new underlying indices automatically based on ilm policy criteria

I will try and rework my setup to do that but I am still left wondering on one of my initial questions that I maybe did not express clearly:
Can I run an ingest pipeline not on initial ingestion but only on rollover at a later point in time? If so is this the recommended way of changing documents after ingestion or is there a different/better method?

On a side note: My logs were already flowing into elasticsearch just fine and are also already parsed. One document looks like this

{
  "_index": "apache-access-server-2025.06.17",
  "_id": "***",
  "_version": 1,
  "_source": {
    "@timestamp": "2025-06-17T23:59:56.000Z",
    "geoip.location": {
      "lat": ***,
      "lon": ***
    },
    "user": "-",
    "code": "400",
    "size": "4358",
    "referer": "-",
    "log_file": "/var/log/apache2/***.log",
    "host": "127.0.0.1",
    "target_host": "example.org",
    "path": "/",
    "geoip.country_name": "Norway",
    "geoip.isocode": "NO",
    "geoip.city_name": "Oslo",
    "agent": "-",
    "method": "GET"
  },
  "fields": {
    "referer": [
      "-"
    ],
    "agent": [
      "-"
    ],
    "code": [
      "400"
    ],
    "geoip.country_name.keyword": [
      "Norway"
    ],
    "user.keyword": [
      "-"
    ],
    "geoip.isocode": [
      "NO"
    ],
    "path": [
      "/"
    ],
    "agent.keyword": [
      "-"
    ],
    "geoip.location": [
      {
        "coordinates": [
          ***,
          ***
        ],
        "type": "Point"
      }
    ],
    "log_file.keyword": [
      "/var/log/apache2/***.log"
    ],
    "log_file": [
      "/var/log/apache2/***.log"
    ],
    "host": [
      "127.0.0.1"
    ],
    "geoip.city_name.keyword": [
      "Oslo"
    ],
    "geoip.country_name": [
      "Norway"
    ],
    "method.keyword": [
      "GET"
    ],
    "referer.keyword": [
      "-"
    ],
    "code.keyword": [
      "400"
    ],
    "method": [
      "GET"
    ],
    "geoip.city_name": [
      "Oslo"
    ],
    "target_host": [
      "example.org"
    ],
    "target_host.keyword": [
      "example.org"
    ],
    "@timestamp": [
      "2025-06-17T23:59:56.000Z"
    ],
    "size": [
      4358
    ],
    "user": [
      "-"
    ],
    "geoip.isocode.keyword": [
      "NO"
    ],
    "path.keyword": [
      "/"
    ]
  }
}

rugenl · June 21, 2025, 1:45pm

I accidently deleted my earlier reply...

Rollover isn't reindexing the data, it doesn't change the structure of the index. It only reads/writes the data if the rollover causes the index to move to a different node. Any ingest pipelines aren't used.

Downsampling might be an option. I haven't done it yet, but I think you would map the summary level fields with time_series_dimension: true and statistical fields with time_series_metric values, but don't include the ip address, that data would be summarized out.

Downsampling does work with ILM, so the process would be automatic with your rollover.

dklenke1 · June 23, 2025, 12:56pm

Thanks for the suggestion @rugenl
I checked out the downsampling documentation here

and at first glance it seems to not remove fields. Specifically this part
"All other fields that are neither dimensions nor metrics (that is, label fields), are created in the target downsample index with the same mapping that they had in the source index."

I thought about maybe setting the IP field as time_series_metric: true and trying to get rid of it that way but this all seems way too hacky. This is clearly not the intended use of downsampling. I have instead opted to just drop the last two segments of the IP at ingestion which is "good enough for now" for both my statistics and data privacy needs.

If anyone has anyone has any other ideas I would be happy to try them out but for now this will do. Thanks everyone for the input

RainTown · June 23, 2025, 1:07pm

Ideas? Don’t store the IP at all. Encrypt it at ingest, store an encrypted version, store decryption key (say new key every day) for N days

Edit: my guess is you are trying to avoid updating all your documents to remove a single field, as that’s pretty expensive for the value it adds (or the non-compliance it removes)?

In passing, I’m a little surprised you’ve considered having just a.b from a.b.c.d indexed Obviou it’s your data and you know your use case, you would seem to me to have lost a lot of the important detail there. Eg You get hit by a DDos/DoS , you can’t pinpoint from where.

dklenke1 · June 24, 2025, 12:37pm

The encryption idea would probably work well. If this becomes a more pushing issue I might revisit that. Currently we do still keep the logs on the machines themselves for the n days I would have kept them fully readable in elasticsearch. So for debugging for the admins this normally suffices

Topic		Replies	Views
Delete data from elastic search Elasticsearch	2	605	July 20, 2017
Automatic deleting the documents from indices Elasticsearch	29	4753	June 7, 2019
Delete host specific index data Elasticsearch	2	1754	July 6, 2017
Sharding by time Elasticsearch	16	1513	July 6, 2017
Delete logs in ElasticSearch after certain period Elasticsearch	27	104250	March 20, 2017

Drop user IP from logs after x days

Related topics