Should there be disk savings for storing fields as IP rather than String?


#1

Hello,
I hypothesized that the 4 octets of an IP address would be smaller to store than its 15 UTF8 (?) characters. I seem to be wrong. I tested this by populating 2 indexes with the same, randomly generated, ip address document. Index 1 has a mapping with the appropriate "ip" type. Index 2 has no explicit mapping.

❯ python -c "import json,random; print '\n'.join([json.dumps({'index': {}}) + '\n' + json.dumps({'ip': '.'.join('%s' % random.randint(0,255) for i in range(4))}) for x in range(10000000/4)])" > ips-for-bulk-insert.json
❯ ES_HEAP_SIZE=5g /opt/elasticsearch-2.4.2/bin/elasticsearch --node.master=true --node.data=true --path.data=./delete-me &
❯ curl -s -XPOST localhost:9200/myindex1/ --data '{ "mappings": { "mytype": { "properties": { "ip": { "type": "ip" } } } } }'
❯ curl -s -XPOST localhost:9200/myindex1/mytype/_bulk --data-binary @ips-for-bulk-insert.json > /dev/null 2>&1
❯ curl -s -XPOST localhost:9200/myindex2/mytype/_bulk --data-binary @ips-for-bulk-insert.json > /dev/null 2>&1
❯ curl -s -XPOST "localhost:9200/_forcemerge?max_num_segments=1"
❯ du -h ips-for-bulk-insert.json
91M ips-for-bulk-insert.json
❯ du -sh ./delete-me/elasticsearch/nodes/0/indices/myindex1
151M ./delete-me/elasticsearch/nodes/0/indices/myindex1
❯ du -sh ./delete-me/elasticsearch/nodes/0/indices/myindex2
152M ./delete-me/elasticsearch/nodes/0/indices/myindex2
❯ curl localhost:9200/_mapping
{"myindex2":{"mappings":{"mytype":{"properties":{"ip":{"type":"string"}}}}},"myindex1":{"mappings":{"mytype":{"properties":{"ip":{"type":"ip"}}}}}}

So, I'm a little confused by this result. Is there some kind of overhead I'm not accounting for? Some mistakes in my experiment? Is string compression just that good?

One thing I noticed was, before I forced merging, the IP index was actually bigger than the String index.

Any insights would be appreciated. Thanks.


(Tanguy) #2

Hi,

This look very similar... can you please explain what confused you in your results?

Note that it's perfectly fine that index files on the filesystem have different size before the forced merge; the ip field is indexed differently using the "ip" type (indexed as a numeric to allow fast range queries) and the string type.


(Tanguy) #3

Also, you could be interested in this blog article :


#4

I was hoping that myindex2, which has the proper IP mapping, would be much smaller than myindex2, which has no mapping and defaults to String. We're planning to scale our system up and I was hoping for some space savings by using proper types.

Thank you for the article link. It was a very interesting read. Just to be clear: The experiment was done with ES 2.4.2, and all the IPs were IPv4. So the increased overhead of storing everything as IPv6 shouldn't have affected us, if I read correctly.


#5

I just did a follow up test using ES 5.1.2, the same data set, and the following mapping:
❯ curl localhost:9200/_mapping
{"ip-index":{"mappings":{"mytype":{"properties":{"ip":{"type":"ip"}}}}},"string-index":{"mappings":{"mytype":{"properties":{"ip":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}}}}

Note that I only explicitly set ip-index's mapping, string-index's mapping was auto generated.

The results were a size of 143MB for the ip-index and 217MB for the string-index. So, it looks like there are savings I just need to upgrade to ES5.

Thanks for pointing me in the right direction.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.