Should there be disk savings for storing fields as IP rather than String?

aeio · January 16, 2017, 5:05am

Hello,
I hypothesized that the 4 octets of an IP address would be smaller to store than its 15 UTF8 (?) characters. I seem to be wrong. I tested this by populating 2 indexes with the same, randomly generated, ip address document. Index 1 has a mapping with the appropriate "ip" type. Index 2 has no explicit mapping.

❯ python -c "import json,random; print '\n'.join([json.dumps({'index': {}}) + '\n' + json.dumps({'ip': '.'.join('%s' % random.randint(0,255) for i in range(4))}) for x in range(10000000/4)])" > ips-for-bulk-insert.json
❯ ES_HEAP_SIZE=5g /opt/elasticsearch-2.4.2/bin/elasticsearch --node.master=true --node.data=true --path.data=./delete-me &
❯ curl -s -XPOST localhost:9200/myindex1/ --data '{ "mappings": { "mytype": { "properties": { "ip": { "type": "ip" } } } } }'
❯ curl -s -XPOST localhost:9200/myindex1/mytype/_bulk --data-binary @ips-for-bulk-insert.json > /dev/null 2>&1
❯ curl -s -XPOST localhost:9200/myindex2/mytype/_bulk --data-binary @ips-for-bulk-insert.json > /dev/null 2>&1
❯ curl -s -XPOST "localhost:9200/_forcemerge?max_num_segments=1"
❯ du -h ips-for-bulk-insert.json
91M ips-for-bulk-insert.json
❯ du -sh ./delete-me/elasticsearch/nodes/0/indices/myindex1
151M ./delete-me/elasticsearch/nodes/0/indices/myindex1
❯ du -sh ./delete-me/elasticsearch/nodes/0/indices/myindex2
152M ./delete-me/elasticsearch/nodes/0/indices/myindex2
❯ curl localhost:9200/_mapping
{"myindex2":{"mappings":{"mytype":{"properties":{"ip":{"type":"string"}}}}},"myindex1":{"mappings":{"mytype":{"properties":{"ip":{"type":"ip"}}}}}}

So, I'm a little confused by this result. Is there some kind of overhead I'm not accounting for? Some mistakes in my experiment? Is string compression just that good?

One thing I noticed was, before I forced merging, the IP index was actually bigger than the String index.

Any insights would be appreciated. Thanks.

tanguy · January 16, 2017, 11:44am

Hi,

This look very similar... can you please explain what confused you in your results?

Note that it's perfectly fine that index files on the filesystem have different size before the forced merge; the ip field is indexed differently using the "ip" type (indexed as a numeric to allow fast range queries) and the string type.

tanguy · January 16, 2017, 11:46am

Also, you could be interested in this blog article :

aeio · January 16, 2017, 5:50pm

I was hoping that myindex2, which has the proper IP mapping, would be much smaller than myindex2, which has no mapping and defaults to String. We're planning to scale our system up and I was hoping for some space savings by using proper types.

Thank you for the article link. It was a very interesting read. Just to be clear: The experiment was done with ES 2.4.2, and all the IPs were IPv4. So the increased overhead of storing everything as IPv6 shouldn't have affected us, if I read correctly.

aeio · January 16, 2017, 7:00pm

I just did a follow up test using ES 5.1.2, the same data set, and the following mapping:
❯ curl localhost:9200/_mapping
{"ip-index":{"mappings":{"mytype":{"properties":{"ip":{"type":"ip"}}}}},"string-index":{"mappings":{"mytype":{"properties":{"ip":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}}}}

Note that I only explicitly set ip-index's mapping, string-index's mapping was auto generated.

The results were a size of 143MB for the ip-index and 217MB for the string-index. So, it looks like there are savings I just need to upgrade to ES5.

Thanks for pointing me in the right direction.

system · February 13, 2017, 7:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to store array of IP? Elasticsearch	7	3584	July 5, 2017
Field data types size Elasticsearch	6	1436	February 28, 2019
IP address component search Elasticsearch	7	254	October 30, 2023
Storing/searching IPs Elasticsearch	4	464	July 6, 2017
String to IP Convert Elasticsearch	6	376	January 23, 2023

Should there be disk savings for storing fields as IP rather than String?

Related topics