Field _size on data stream

Hi,I'd like to set the _size field on a data stream associated with an integration. How can I do this? The data stream has a custom index template associated with it:

GET _index_template/logs-thales_udp.log-custom

{ "index_templates": [ { "name": "logs-thales_udp.log-custom", "index_template": { "index_patterns": [ "logs-thales_udp.log-*" ], "template": { "settings": { "index": { "lifecycle": { "name": "thales-ilmpolicy-custom" } } }, "mappings": { "_meta": { "package": { "name": "udp" }, "managed_by": "fleet", "managed": false } } }, "composed_of": [ "logs@mappings", "logs@settings", "logs-thales_udp.log@package", "logs@custom", "logs-thales_udp.log@custom", "ecs@mappings", ".fleet_globals-1", ".fleet_agent_id_verification-1" ], "priority": 501, "_meta": { "package": { "name": "udp" }, "managed_by": "fleet", "managed": false }, "data_stream": { "hidden": false, "allow_custom_routing": false }, "ignore_missing_component_templates": [ "logs@custom", "logs-thales_udp.log@custom" ] } } ]}

I installed the mapper_size plugin

If I try to set the _size field with:

PUT _index_template/logs-thales_udp.log-custom{ "mappings": { "_size": { "enabled": true } }}

I get an error:

{ "error": { "root_cause": [ { "type": "x_content_parse_exception", "reason": "[2:3] [index_template] unknown field [mappings]" } ], "type": "x_content_parse_exception", "reason": "[2:3] [index_template] unknown field [mappings]" }, "status": 400}

Hello @Cristina_Marletta_Li

If we use this in template block with index patterns it does not give error :

PUT _index_template/logs-thales_udp.log-custom
{
  "index_patterns": ["logs-thales_udp.log-*"],
  "template": {
    "mappings": {
      "_size": {
        "enabled": true
      }
    }
  }
}

Thanks!!

Thank you Tortoise.

I've now successfully modified the index template (using the composable template ...@custom). I installed the mapper-size plugin on the master, hot, and cold nodes. I rolled over, but I don't see the _size field on the new indexes.

Hello @Cristina_Marletta_Li

Could you please check the index mapping for new index generated post rollover ?

GET <new-index/datastream-name>/_mapping

If not check the mapping for latest index ?

GET logs-thales_udp.log/_mapping

Thanks!!

Hello @Tortoise ,

In the mapping section, I can see
"_size":
{
"enabled": true
}

I found the value of _size in an event just by looking at its JSON format. It's not in the Table representation of the event. Strange!

Hello @Cristina_Marletta_Li

if you want to view it in Table need to add as part of below :

Related documentation :

Thanks!!

Thank you @Tortoise,

now there's only one thing missing.

If I query the index via API (e.g.,
GET logs-thales_udp.log-default/_search
{
"query": {
"match": {
"_id": "xxxxxx"
}
}
}

I don't see the _size field in the response.

Now I see the _size field in the metadata though.

Try

GET logs-thales_udp.log-default/_search
{
  "fields : ["*"],
  "query": {
    "match": {
     "_id": "xxxxxx"
    }
  }
}

All ok! I have the _size field!

Now I wonder: what meaning should I give to the _size field since I am interested in the storage occupation of a set of events?

I thought that _size was the number of bytes the event _source is made up of in its json format but that is not the case.

Can you help me understand this field?

Thank you

From the docs here

The mapper-size plugin provides the _size metadata field which, when enabled, indexes the size in bytes of the original _source field.

Which is not the entire size/number of bytes used for the entire document when indexed. (Inverted index, doc values, synthetic source etc) The size on disk cannot be calculated before the document is written, because the size is not known... therefore there is no way to store that with the field... because documents are immutable! :slight_smile:

So I'm curious now. What are you trying to figure out?

Are you looking for the average size per doc?
That is simple count / primary size on disk.

Do you want to know what fields are taking up the most space?

Use the _disk_usage API

And of course, typically the size on disk is reduced after merging happens, so avg doc size usually shrinks after merging as compared to when it was first written

In fairness, thats not what the doc says. It says it's the "the size in bytes of the original _source field".

@Cristina_Marletta_Li

A quick test showed a (empty) doc with no fields/values, has _size == 3, which is maybe {} plus a null? Add a keyword field {"x":""}, then _size jumps to 9. Add spaces around the : and _size is now 11. Set {"x":"1234567890"} and its 19. So seems to check out to me, as I'm not going to quibble about a byte here or there!?

Total agreement, not sure where we are misaligned :slight_smile:

The main takeaway is that _size doesn't represent the total storage needed to index the document to disk. It only reflects the size of the _source field's JSON before indexing, pretty sure uncompressed, as you demonstrated.

The actual size on disk (for non-Synthetic Source) is typically the compressed size of the _source field plus the data structures necessary for indexing based on the mappings. Therefore, _size and the actual storage required on disk for a document are not the same.

I just wanted to clarify this. If we aren't focused on the actual size on disk, then perhaps we're all set.

With Synthetic Source, the _source field isn't stored on disk at all, which creates a much more significant difference.

The only reliable way to see the actual storage on disk is by using the _disk_usage API I mentioned earlier.

Historically, _size was used to understand "the size of what I'm indexing" and was often confused with the size on disk. However, it was a helpful indicator for identifying which documents were "expensive" or "heavy."

So, getting back to my earlier question: what are we trying to solve here? :slightly_smiling_face:

AFAIK we're not, which is good !!

I was just trying to understand the very narrow point the OP had made:

Plus or minus the odd byte, for at least the small samples I used to validate, the _size field seems to me to match what it said it would be in the docs. @Cristina_Marletta_Li might wish to share her experience / evidence to the contrary ?

@Cristina_Marletta_Li can answer that for herself. But sometimes people just want to understand something at quite a low level. Or indeed verify "actual" matches "documented". It might even be "asking the wrong question", but ... no harm to ask, and right now I dont know the "what are we trying to solve" Q either!

1 Like