I have very basic question, when I index a document, for example
POST /customer/_doc/1/_update?pretty
{
"doc": { "name": "Jane Doe" }
}
is the key also indexed?. To be specific the string "name" in the above example.
Not really. What is actually your question?
I am sorry that the question is confusing. My question is that does the key strings also indexed and are they part of reverse index. That is term "name", in the example i posted in the question, is also part of reverse index ?
to be more precise
POST /customer/_doc/1/_update?pretty
{
"doc":
{"key1": "value1"},
{"key2": "value2"},
{"key3": "value3"},
{"key4": "value4"},
{"key5": "value5"}
}
does the strings "key1","key2","key3","key4","key5" adds to the disk space if the _source is disabled ?
So your question is like "how can I reduce the disk space?"
I'd say that in a certain way they are. I think that the exist
query relies on that.
But i don't think it's a problem. So what problem would you like to solve ?
I am looking for all possible options to reduce disk space. I have already done things mentioned in page. Now I am trying to understand further on internal details
So coming back to question if index a million docs of following
{
"doc":
{"key1": "value1"},
{"key2": "value2"},
{"key3": "value3"},
{"key4": "value4"},
{"key5": "value5"}
}
then I am wondering what percent of my disk space is utilised by the key strings in the JSON.
Is it a problem for you or just a conceptual question?
I mean that disk space is cheap. Overengineering tends to be expensive.
But if you want to see exactly what you have in an index, you can use Luke to read the shards content. See
It is in a way a problem since I am not able to satisfactory reduction in disk space since I am handling TBs of data.
Also the Luke is a great insight and thank you very much for pointing me to it.
That's the right question to ask IMO.
What do you have as source documents in term of volume?
What do you have as volume after having indexed everything and running the forcemerge API?
What is the mapping you are using?
What gives:
GET /_cat/nodes?v
GET /_cat/indices?v
GET /_cat/shards?v
The source volume is 2.46 TB
ands once its index the store size is 1.1 TB (1 set primary shards and 1 replica)
My Mapping is
"mappings" : {
"doc" : {
"_size" : {
"enabled" : true
},
"properties" : {
"@timestamp" : {
"type" : "date"
},
"@version" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"hostname" : {
"type" : "text",
"analyzer" : "pattern"
},
"log" : {
"properties" : {
"flags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"log_level" : {
"type" : "text",
"analyzer" : "pattern"
},
"message" : {
"type" : "text",
"analyzer" : "pattern"
},
"package" : {
"type" : "text",
"analyzer" : "pattern"
},
"source" : {
"type" : "text",
"analyzer" : "pattern"
},
"tags" : {
"type" : "text",
"analyzer" : "pattern"
},
"thread" : {
"type" : "text",
"analyzer" : "pattern"
}
}
}
}
and I dont know but I didnt see any change in space after force merge, may i did something wrong. I thought of taking up next. But for the sake of completion I am pasting the output here
I had run
POST /index_name/_forcemerge
It ran and provided an output as
{
"_shards" : {
"total" : 32,
"successful" : 32,
"failed" : 0
}
}
green open index_name APP-cEzLR-i4vrhoFyL20Q 16 1 6797306863 0 1.1tb 605.2gb
The output for GET /_cat/nodes?v is
10.240.0.26 5 98 9 0.01 0.09 0.08 i - coord4
10.240.0.132 48 98 30 2.10 2.19 2.42 mdi - node3
10.240.0.27 6 98 4 0.05 0.11 0.09 i - coord3
10.240.0.29 4 98 12 0.46 0.45 0.36 i - coord2
10.240.0.134 63 98 29 2.71 2.68 2.64 di - node5
10.240.0.28 11 98 23 0.57 0.40 0.36 i - coord1
10.240.0.12 21 98 25 1.71 2.11 2.25 mdi * node2
10.240.0.136 44 98 25 2.84 2.59 2.58 di - node7
10.240.0.135 34 96 28 3.21 2.81 2.73 di - node6
10.240.0.133 40 98 26 2.62 2.60 2.68 di - node4
10.240.0.11 72 97 34 2.72 2.48 2.39 mdi - node1
10.240.0.30 5 98 18 0.01 0.05 0.05 mi - master
The output for GET /_cat/indices?v
green open index_name APP-cEzLR-i4vrhoFyL20Q 16 1 6797306863 0 1.1tb 605.2gb
The output of GET /_cat/shards?v is
index_name 14 p STARTED 424166345 37.8gb 10.240.0.133 node4
index_name 14 r STARTED 424166414 37.7gb 10.240.0.135 node6
index_name 4 p STARTED 424171600 37.8gb 10.240.0.132 node3
index_name 4 r STARTED 424171083 37.7gb 10.240.0.136 node7
index_name 12 p STARTED 424184503 37.7gb 10.240.0.136 node7
index_name 12 r STARTED 424184511 37.7gb 10.240.0.11 node1
index_name 8 r STARTED 424192628 37.7gb 10.240.0.133 node4
index_name 8 p STARTED 424192671 37.9gb 10.240.0.135 node6
index_name 13 r STARTED 424165421 37.7gb 10.240.0.132 node3
index_name 13 p STARTED 424164957 37.8gb 10.240.0.11 node1
index_name 10 r STARTED 424182004 37.7gb 10.240.0.132 node3
index_name 10 p STARTED 424181156 37.8gb 10.240.0.12 node2
index_name 5 p STARTED 424184124 37.7gb 10.240.0.136 node7
index_name 5 r STARTED 424184124 37.6gb 10.240.0.11 node1
index_name 2 p STARTED 424158150 37.5gb 10.240.0.134 node5
index_name 2 r STARTED 424157772 37.4gb 10.240.0.132 node3
index_name 6 r STARTED 424162765 37.6gb 10.240.0.134 node5
index_name 6 p STARTED 424161952 37.7gb 10.240.0.11 node1
index_name 1 p STARTED 424148057 37.9gb 10.240.0.135 node6
index_name 1 r STARTED 424148503 37.8gb 10.240.0.12 node2
index_name 7 p STARTED 424234489 37.9gb 10.240.0.133 node4
index_name 7 r STARTED 424233599 39gb 10.240.0.12 node2
index_name 9 r STARTED 424193478 37.8gb 10.240.0.135 node6
index_name 9 p STARTED 424193779 37.7gb 10.240.0.134 node5
index_name 15 p STARTED 424190986 37.6gb 10.240.0.135 node6
index_name 15 r STARTED 424191265 37.6gb 10.240.0.134 node5
index_name 3 r STARTED 424224518 37.7gb 10.240.0.133 node4
index_name 3 p STARTED 424223699 37.8gb 10.240.0.12 node2
index_name 11 p STARTED 424207929 37.7gb 10.240.0.132 node3
index_name 11 r STARTED 424207417 37.7gb 10.240.0.136 node7
index_name 0 p STARTED 424158289 37.7gb 10.240.0.133 node4
index_name 0 r STARTED 424158641 37.7gb 10.240.0.134 node5
I do have few other indices as well but this is the one that is of major concern
If I'm not mistaken, you have 2.46 TB of source data.
This is becoming 605 gb (primaries only). So it's only 25% of the original size.
What makes you unhappy with a 75% compression ratio?
You can still though optimize a bit your mapping. Remove the following fields:
_size
-
@version
unless you need it -
flags.keyword
orflags
unless you are doing both full text search on it AND aggregations.
Lot of fields have a pattern
analyzer? What is it? Are you sure it needs to be applied to all the fields?
Primary alone is giving 25%, but in production even with one replica its almost down to 50%. And when considering other scenarios for production readiness such as DR and HA it shoots up even more. Hence the concern.
For mapping, _size was kept for intentionally but other will be removed. The pattern analyzer is also intentional since the standard analyzer was not providing few desired search results like a.b was one token where as a1.b was 2 tokens.
The pattern analyzer is also intentional since the standard analyzer was not providing few desired search results like a.b was one token where as a1.b was 2 tokens.
As long you're sure it is emitting less tokens, then there is probably nothing to gain from here. Unless some fields are not meant to be queried with a full text search but only for exact values, in which case that's a waste of space IMHO.
If you really want to reduce disk space, you can consider removing doc_values
and use fielddata
instead. But this will use a lot of memory then.
Primary alone is giving 25%, but in production even with one replica its almost down to 50%.
Yeah. But the size of original replicated data is then 4.92tb that you can compare to 1.2tb, which is still 25% of the source data.
You can't decently compare orange and apples. Otherwise you can also tell me that if you have 3 replicas, it's the same size as the original data, which is a bit unfair comparison IMO.
And when considering other scenarios for production readiness such as DR and HA it shoots up even more. Hence the concern.
Sure. You can't really win on all aspects. If you need redundancy, then you need to copy the data.
If you need to save disk space, you have to give up on speed, memory or redundancy.
Always a price to pay.
I'm not saying that you should not try to optimize as much as you can, but I really think that the storage price vs over-engineering price (and all the side effects you are not thinking about) is cheaper.
Then depending on your use case, ie. if you have time-based data (timestamp + data), you should consider using:
- Time based indices
- Index Lifecycle Management (or curator)
- Rollover API
The rollover API can be huge win in term of disk space for such a use case. So big that you don't even think about disabling the _source
which I'd never do.
But again, it depends. On your use case mainly.
Except the log level, which I need to change it to keyword, remaining needs to be queried. For example the hostname field have values such as google.com and with standard analyser I cant search for terms like google alone unless I use wild card search. Thats the reason for using pattern analyser.
Sure i will look into this. Thank you for the insight.
As you mentioned mine is Time based Indices and my source is of volume 2.46 TB. When I mentioned the replica size what I meant was that in my set up I have 1 set of primary shards and replica is set to 1. Thus the eventual store size is 1.1 TB. I know its not a very fair comparison but my eventual ratio of source to store is 50% and that too without the additional replica that I might have to add for HA-DR.
I completely agree on this. That is the reason why I am trying to figure out which is the best bet that I can get.
I am already using the time based indices, the index name that had posted didn't reflect that due to privacy reasons.
I have curator to clear indices that are older than my retention period.
I have not considered using rollover API since I have already approximated my index size and set the number of primary shard accordingly, 30-50 GB per shard according the documentations in elastic (this might not completely reflect in the output of the queries that I previously posted since this is a test set up).
If the reasons that you mentioned is anything apart from what i mentioned then please point me to a resource that describes it and will surely go ahead with it.
Ooops. I was tired.
I meant the Rollup API not the rollover API. Sorry.
Haha. No problem at all. You already are helping a lot. Please don't apologies.
I will look into this and surely this seems it will come handy since this would reduce the disk usage of the older data.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.