How to take profit of compression for a Time Serie Data Stream

Hello,

Every hour I generate a bulk request with several documents into a data stream, each update sets a @timestamp field for all the bulk of documents. (the idea is to keep track of a source of data in time).

The problem is that comparing each update from the previous there are several duplicated documents. To optimize the size of the datastream I thought that is was a good idea to use TSDS tool since looks that can acomplish aproximatelly a 70% of compression.

When I run it, I don't see any compression going on, the size of the data stream is the same as I don't use Time Series. Also I observe that two updates(both updates add the same documents) has the double size of one update(it is not compressing).

Does someone know how can I compress the duplicated fields from documents with different timestamp ?

Thanks and best regards.

What does your document looks like? Is it your data logs or traces?

Also, can you share the template of your TSDS?

The documents are time stamped metrics. (I guess not the most common use of ES?)

Each document contains metadata related to a file and it describes properties linked to that file.
(the file is stored in a remote hard drive that it is update continuosly).

An example of a document:

{
"filename": "path_filename_with_metadata",
"metadata1":"m1",
"metadata2":"m2",
"metadata3":"m3",
"@timestamp": "2023-10-11T13:47:05.000Z" 
}

filename field is unique, it will not be found more than one in one update.
That's the field that I set as "time_series_dimension": true in the mapping of the index_template.

This is the index_template:

{
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "policy"
        },
        "mode": "time_series",
        "codec": "best_compression",
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_hot"
            }
          }
        },
        "routing_path": [
          "file_name"
        ],
        "time_series": {
          "end_time": "2023-10-11T14:03:06.000Z",
          "start_time": "2023-10-11T10:03:06.000Z"
        }
      }
    },
    "mappings": {
      "properties": {
        "file_name": {
          "type": "keyword",
          "time_series_dimension": true
        }
      }
    },
    "aliases": {}
  }
}

Thanks.

Any reason of why you didn't set time_series_dimension for the metadataX fields?

I don't think you will see much difference in compression comparing with datastream with just one field as time_series_dimesions, specially if you were already using the best_compression codec in the normal data stream.

I have added more fields as time_series_dimension fields.

I don't see any compression when I update 3 times the bulk, the datastream takes 3x the size of the first bulk update.

Let's say that the TSDS has this content, just as an example both update has the same documents:

(First update)
doc1 with @timestamp=1
doc2 with @timestamp=1
...
docn with @timestamp=1

(Second update)
doc1 with @timestamp=2
doc2 with @timestamp=2
...
docn with @timestamp=2

Considering that docs are the same in both updates but the only difference will be the timestamp field (we can say that in the datastream each doc is one time duplicated besides the timstamp field) and all fields in the docs are dimensions.

Being this case should I see any compression going on?
(I mean the compression that takes profit of duplicated fields after more than one bulk update, it is of course compressing a lot just after the firtst bulk update, since the total docs size is way less than the sum of each individual doc).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.