Hello,
I hypothesized that the 4 octets of an IP address would be smaller to store than its 15 UTF8 (?) characters. I seem to be wrong. I tested this by populating 2 indexes with the same, randomly generated, ip address document. Index 1 has a mapping with the appropriate "ip" type. Index 2 has no explicit mapping.
❯ python -c "import json,random; print '\n'.join([json.dumps({'index': {}}) + '\n' + json.dumps({'ip': '.'.join('%s' % random.randint(0,255) for i in range(4))}) for x in range(10000000/4)])" > ips-for-bulk-insert.json
❯ ES_HEAP_SIZE=5g /opt/elasticsearch-2.4.2/bin/elasticsearch --node.master=true --node.data=true --path.data=./delete-me &
❯ curl -s -XPOST localhost:9200/myindex1/ --data '{ "mappings": { "mytype": { "properties": { "ip": { "type": "ip" } } } } }'
❯ curl -s -XPOST localhost:9200/myindex1/mytype/_bulk --data-binary @ips-for-bulk-insert.json > /dev/null 2>&1
❯ curl -s -XPOST localhost:9200/myindex2/mytype/_bulk --data-binary @ips-for-bulk-insert.json > /dev/null 2>&1
❯ curl -s -XPOST "localhost:9200/_forcemerge?max_num_segments=1"
❯ du -h ips-for-bulk-insert.json
91M ips-for-bulk-insert.json
❯ du -sh ./delete-me/elasticsearch/nodes/0/indices/myindex1
151M ./delete-me/elasticsearch/nodes/0/indices/myindex1
❯ du -sh ./delete-me/elasticsearch/nodes/0/indices/myindex2
152M ./delete-me/elasticsearch/nodes/0/indices/myindex2
❯ curl localhost:9200/_mapping
{"myindex2":{"mappings":{"mytype":{"properties":{"ip":{"type":"string"}}}}},"myindex1":{"mappings":{"mytype":{"properties":{"ip":{"type":"ip"}}}}}}
So, I'm a little confused by this result. Is there some kind of overhead I'm not accounting for? Some mistakes in my experiment? Is string compression just that good?
One thing I noticed was, before I forced merging, the IP index was actually bigger than the String index.
Any insights would be appreciated. Thanks.