There have been an incredible number of changes from 1.x to 5.x, both to geo fields and to regular fields. The entire geo infrastructure has moved over to BKD trees, fields have doc values indexed by default now, etc.
And the comparison is between a cluster and a single machine, which means compression will be very different in each scenario (because data will be spread or colocated).
I'm not sure there is any reasonable way to compare these... it's apples vs oranges I'm afraid.
There are also a ton of questions that need to be addressed too:
how many documents?
how many shards?
how are you determining cluster size? 7.4mb seems too small to be a real value even for a tiny index
Have you verified both clusters have the exact same documents
etc etc.
I'd suggest putting together a more rigorous test to ensure as many variables are ruled out. But I'm honestly not sure it's worth the time, a lot has changed since 1.x.
7.1 mb looks to be the original json data size and the explosion in size can be thought of as a similar cost to converting a vector graphic into a raster one. While the vector graphic may describe a diagonal line using only 2 coordinates an equivalent high-resolution, massive scale raster format would be much bigger. Like the pixels in an image, entries in the index represent the cells in a space of your own sizing choice.
Today I installed a 1.5.2 version on the same single machine and the indexing is instantaneous, size is small as reported.
Whatever the change has been with version 5.x, it is huge and I might have missed it, but it would have been good to document it more prominently.
Please note that I do my test with exactly the same script and so the mapping and the data to index are the same.
My feeling is that for normal shapes things can be fast, for complex shapes like countries it is very slow and big in size, and the vector to bitmap comparison make sense, only that I am not sure whether the gain on small shape was worth the huge loss in bigger shapes?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.