Is key values in a JSON doc indexed?

Hari_Prasad · March 8, 2019, 3:04am

I have very basic question, when I index a document, for example
POST /customer/_doc/1/_update?pretty
{
"doc": { "name": "Jane Doe" }
}
is the key also indexed?. To be specific the string "name" in the above example.

dadoonet · March 8, 2019, 4:42am

Not really. What is actually your question?

Hari_Prasad · March 8, 2019, 5:05am

I am sorry that the question is confusing. My question is that does the key strings also indexed and are they part of reverse index. That is term "name", in the example i posted in the question, is also part of reverse index ?

to be more precise

POST /customer/_doc/1/_update?pretty
{
"doc":
{"key1": "value1"},
{"key2": "value2"},
{"key3": "value3"},
{"key4": "value4"},
{"key5": "value5"}
}

does the strings "key1","key2","key3","key4","key5" adds to the disk space if the _source is disabled ?

dadoonet · March 8, 2019, 8:47am

So your question is like "how can I reduce the disk space?"
I'd say that in a certain way they are. I think that the exist query relies on that.

But i don't think it's a problem. So what problem would you like to solve ?

Hari_Prasad · March 8, 2019, 10:26am

I am looking for all possible options to reduce disk space. I have already done things mentioned in page. Now I am trying to understand further on internal details

So coming back to question if index a million docs of following

{
"doc":
{"key1": "value1"},
{"key2": "value2"},
{"key3": "value3"},
{"key4": "value4"},
{"key5": "value5"}
}
then I am wondering what percent of my disk space is utilised by the key strings in the JSON.

dadoonet · March 8, 2019, 10:49am

Is it a problem for you or just a conceptual question?
I mean that disk space is cheap. Overengineering tends to be expensive.

But if you want to see exactly what you have in an index, you can use Luke to read the shards content. See

Hari_Prasad · March 8, 2019, 11:26am

It is in a way a problem since I am not able to satisfactory reduction in disk space since I am handling TBs of data.

Also the Luke is a great insight and thank you very much for pointing me to it.

dadoonet · March 8, 2019, 11:28am

That's the right question to ask IMO.

What do you have as source documents in term of volume?
What do you have as volume after having indexed everything and running the forcemerge API?

What is the mapping you are using?

What gives:

GET /_cat/nodes?v
GET /_cat/indices?v
GET /_cat/shards?v

Hari_Prasad · March 8, 2019, 12:19pm

The source volume is 2.46 TB

ands once its index the store size is 1.1 TB (1 set primary shards and 1 replica)

My Mapping is

"mappings" : {
      "doc" : {
        "_size" : {
          "enabled" : true
        },
        "properties" : {
          "@timestamp" : {
            "type" : "date"
          },
          "@version" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "hostname" : {
            "type" : "text",
            "analyzer" : "pattern"
          },
          "log" : {
            "properties" : {
              "flags" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          },
          "log_level" : {
            "type" : "text",
            "analyzer" : "pattern"
          },
          "message" : {
            "type" : "text",
            "analyzer" : "pattern"
          },
          "package" : {
            "type" : "text",
            "analyzer" : "pattern"
          },
          "source" : {
            "type" : "text",
            "analyzer" : "pattern"
          },
          "tags" : {
            "type" : "text",
            "analyzer" : "pattern"
          },
          "thread" : {
            "type" : "text",
            "analyzer" : "pattern"
          }
        }
      }
    }

and I dont know but I didnt see any change in space after force merge, may i did something wrong. I thought of taking up next. But for the sake of completion I am pasting the output here
I had run

POST /index_name/_forcemerge

It ran and provided an output as

{
  "_shards" : {
    "total" : 32,
    "successful" : 32,
    "failed" : 0
  }
}

green  open   index_name              APP-cEzLR-i4vrhoFyL20Q  16   1 6797306863            0      1.1tb        605.2gb

The output for GET /_cat/nodes?v is

10.240.0.26             5          98   9    0.01    0.09     0.08 i         -      coord4
10.240.0.132           48          98  30    2.10    2.19     2.42 mdi       -      node3
10.240.0.27             6          98   4    0.05    0.11     0.09 i         -      coord3
10.240.0.29             4          98  12    0.46    0.45     0.36 i         -      coord2
10.240.0.134           63          98  29    2.71    2.68     2.64 di        -      node5
10.240.0.28            11          98  23    0.57    0.40     0.36 i         -      coord1
10.240.0.12            21          98  25    1.71    2.11     2.25 mdi       *      node2
10.240.0.136           44          98  25    2.84    2.59     2.58 di        -      node7
10.240.0.135           34          96  28    3.21    2.81     2.73 di        -      node6
10.240.0.133           40          98  26    2.62    2.60     2.68 di        -      node4
10.240.0.11            72          97  34    2.72    2.48     2.39 mdi       -      node1
10.240.0.30             5          98  18    0.01    0.05     0.05 mi        -      master

The output for GET /_cat/indices?v

green  open   index_name             APP-cEzLR-i4vrhoFyL20Q  16   1 6797306863            0      1.1tb        605.2gb

The output of GET /_cat/shards?v is

index_name              14    p      STARTED 424166345  37.8gb 10.240.0.133 node4
index_name              14    r      STARTED 424166414  37.7gb 10.240.0.135 node6
index_name              4     p      STARTED 424171600  37.8gb 10.240.0.132 node3
index_name              4     r      STARTED 424171083  37.7gb 10.240.0.136 node7
index_name              12    p      STARTED 424184503  37.7gb 10.240.0.136 node7
index_name              12    r      STARTED 424184511  37.7gb 10.240.0.11  node1
index_name              8     r      STARTED 424192628  37.7gb 10.240.0.133 node4
index_name              8     p      STARTED 424192671  37.9gb 10.240.0.135 node6
index_name              13    r      STARTED 424165421  37.7gb 10.240.0.132 node3
index_name              13    p      STARTED 424164957  37.8gb 10.240.0.11  node1
index_name              10    r      STARTED 424182004  37.7gb 10.240.0.132 node3
index_name              10    p      STARTED 424181156  37.8gb 10.240.0.12  node2
index_name              5     p      STARTED 424184124  37.7gb 10.240.0.136 node7
index_name              5     r      STARTED 424184124  37.6gb 10.240.0.11  node1
index_name              2     p      STARTED 424158150  37.5gb 10.240.0.134 node5
index_name              2     r      STARTED 424157772  37.4gb 10.240.0.132 node3
index_name              6     r      STARTED 424162765  37.6gb 10.240.0.134 node5
index_name              6     p      STARTED 424161952  37.7gb 10.240.0.11  node1
index_name              1     p      STARTED 424148057  37.9gb 10.240.0.135 node6
index_name              1     r      STARTED 424148503  37.8gb 10.240.0.12  node2
index_name              7     p      STARTED 424234489  37.9gb 10.240.0.133 node4
index_name              7     r      STARTED 424233599    39gb 10.240.0.12  node2
index_name              9     r      STARTED 424193478  37.8gb 10.240.0.135 node6
index_name              9     p      STARTED 424193779  37.7gb 10.240.0.134 node5
index_name              15    p      STARTED 424190986  37.6gb 10.240.0.135 node6
index_name              15    r      STARTED 424191265  37.6gb 10.240.0.134 node5
index_name              3     r      STARTED 424224518  37.7gb 10.240.0.133 node4
index_name              3     p      STARTED 424223699  37.8gb 10.240.0.12  node2
index_name              11    p      STARTED 424207929  37.7gb 10.240.0.132 node3
index_name              11    r      STARTED 424207417  37.7gb 10.240.0.136 node7
index_name              0     p      STARTED 424158289  37.7gb 10.240.0.133 node4
index_name              0     r      STARTED 424158641  37.7gb 10.240.0.134 node5

I do have few other indices as well but this is the one that is of major concern

dadoonet · March 8, 2019, 1:25pm

If I'm not mistaken, you have 2.46 TB of source data.
This is becoming 605 gb (primaries only). So it's only 25% of the original size.

What makes you unhappy with a 75% compression ratio?

You can still though optimize a bit your mapping. Remove the following fields:

_size
@version unless you need it
flags.keyword or flags unless you are doing both full text search on it AND aggregations.

Lot of fields have a pattern analyzer? What is it? Are you sure it needs to be applied to all the fields?

Hari_Prasad · March 8, 2019, 3:58pm

Primary alone is giving 25%, but in production even with one replica its almost down to 50%. And when considering other scenarios for production readiness such as DR and HA it shoots up even more. Hence the concern.

For mapping, _size was kept for intentionally but other will be removed. The pattern analyzer is also intentional since the standard analyzer was not providing few desired search results like a.b was one token where as a1.b was 2 tokens.

dadoonet · March 8, 2019, 4:24pm

The pattern analyzer is also intentional since the standard analyzer was not providing few desired search results like a.b was one token where as a1.b was 2 tokens.

As long you're sure it is emitting less tokens, then there is probably nothing to gain from here. Unless some fields are not meant to be queried with a full text search but only for exact values, in which case that's a waste of space IMHO.

If you really want to reduce disk space, you can consider removing doc_values and use fielddata instead. But this will use a lot of memory then.

Primary alone is giving 25%, but in production even with one replica its almost down to 50%.

Yeah. But the size of original replicated data is then 4.92tb that you can compare to 1.2tb, which is still 25% of the source data.
You can't decently compare orange and apples. Otherwise you can also tell me that if you have 3 replicas, it's the same size as the original data, which is a bit unfair comparison IMO.

And when considering other scenarios for production readiness such as DR and HA it shoots up even more. Hence the concern.

Sure. You can't really win on all aspects. If you need redundancy, then you need to copy the data.
If you need to save disk space, you have to give up on speed, memory or redundancy.
Always a price to pay.

I'm not saying that you should not try to optimize as much as you can, but I really think that the storage price vs over-engineering price (and all the side effects you are not thinking about) is cheaper.

Then depending on your use case, ie. if you have time-based data (timestamp + data), you should consider using:

Time based indices
Index Lifecycle Management (or curator)
Rollover API

The rollover API can be huge win in term of disk space for such a use case. So big that you don't even think about disabling the _source which I'd never do.

But again, it depends. On your use case mainly.

Hari_Prasad · March 9, 2019, 4:01am

Except the log level, which I need to change it to keyword, remaining needs to be queried. For example the hostname field have values such as google.com and with standard analyser I cant search for terms like google alone unless I use wild card search. Thats the reason for using pattern analyser.

Sure i will look into this. Thank you for the insight.

As you mentioned mine is Time based Indices and my source is of volume 2.46 TB. When I mentioned the replica size what I meant was that in my set up I have 1 set of primary shards and replica is set to 1. Thus the eventual store size is 1.1 TB. I know its not a very fair comparison but my eventual ratio of source to store is 50% and that too without the additional replica that I might have to add for HA-DR.

I completely agree on this. That is the reason why I am trying to figure out which is the best bet that I can get.

I am already using the time based indices, the index name that had posted didn't reflect that due to privacy reasons.
I have curator to clear indices that are older than my retention period.
I have not considered using rollover API since I have already approximated my index size and set the number of primary shard accordingly, 30-50 GB per shard according the documentations in elastic (this might not completely reflect in the output of the queries that I previously posted since this is a test set up).

If the reasons that you mentioned is anything apart from what i mentioned then please point me to a resource that describes it and will surely go ahead with it.

dadoonet · March 9, 2019, 4:48am

Ooops. I was tired.

I meant the Rollup API not the rollover API. Sorry.

Hari_Prasad · March 9, 2019, 4:56am

Haha. No problem at all. You already are helping a lot. Please don't apologies.

I will look into this and surely this seems it will come handy since this would reduce the disk usage of the older data.

system · April 6, 2019, 4:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Index Space Utilization On Elastic Nodes Elasticsearch	14	251	April 19, 2024
Some interesting storage numbers for people interested Elasticsearch	7	410	July 6, 2017
Repeating key impact on disk usage Elasticsearch	7	484	July 18, 2018
Elasticsearch Disk Space Issue Elasticsearch	13	763	October 28, 2024
Want to know what is the file are present in indecies Elasticsearch	16	1085	June 5, 2018

Is key values in a JSON doc indexed?

Related topics