I can do that, I will write something up and report back.
I wrote a small C# .NET app to index directly into ES, I got the same size differences and really slow indexing.
For ES 1.7.2 I used NEST client 1.7.2, For ES 2.3.2 I used NEST client 2.3.2, however I think the 1.7.2 client was able index in ES 2.3.2, anyway I tested with different client s to adhere to the supported ES/Client combinations.
Stats from head plugin
ES 1.7.2
size: 15.5Mi (15.5Mi)
docs: 2,000 (2,000)
ES 2.3.2
size: 2.84Gi (2.84Gi)
docs: 2,000 (2,000)
Mapping (Same mapping)
{ "mappings": { "tomato": { "_all" :{ "enabled" : false }, "properties": { "id": { "type": "string", "index": "not_analyzed", "doc_values": false }, "dateTimeCreated": { "type": "date" }, "dateTimeModified": { "type": "date" }, "name": { "type": "string" }, "description": { "type": "string" }, "isPublic": { "type": "boolean" }, "tomatoCenter": { "type": "geo_point", "geohash": true, "geohash_prefix": true, "geohash_precision": 3 }, "tomatoShape": { "type": "geo_shape", "tree": "quadtree", "precision": "1m" }, "type": { "type": "string" }, "farmId": { "type": "string", "index": "not_analyzed", "doc_values": false } } } } }
Same Content (Sample), all shapes are "circle"
{ "_index": "upgarde-test", "_type": "tomato", "_id": "AVURtqKTfJzofNCcHO8s", "_version": 1, "_score": 1, "_source": { "description": "Random batchNo5 #tomatoIdx3", "tomatorShape": { "coordinates": [ -87.21241972439854 , 41.53887701360597 ], "type": "circle", "radius": 4444 }, "tomatoCenter": [ -87.21241972439854 , 41.53887701360597 ], "isPublic": true, "name": "Batch#5 Count#3", "tags": [ "batchNo5" , "tomatoIdx3" ], "type": "tomato", "farmId": "farm_c6840442-7312-473c-8501-ed035dcc65bf", "dateTimeCreated": "0001-01-01T00:00:00" } }
C# code to index
var bulkDescriptor = new BulkDescriptor();
for (int b = 0; b < 200; b++)
var tomato= GetRandomTomato(i, b);
bulkDescriptor.Index<Tomato>(op => op.Document(tomato).Index(EsIndexName));
var result = ElasticNestClient.Bulk(bulkDescriptor);
Similar issue reported here
I replied to the other issue, Elasticsearch upgrade from 1.7.1 to 2.3.2 then create index very slow
do you think you can do the same?
I just did that, it made day & night difference.
Index size is even smaller now in the new ES version 15Mi old vs 12.8Mi new, indexing speed is also almost instantaneous considering I just indexed 2000 docs.
ES 2.3.2
size: 12.8Mi (12.8Mi)
docs: 2,000 (2,000)
Before I was using separate new data folder for the new version, now I created the index in the old version (just index create and no data), used the same data folder for the new version, then indexed data using the new ES version.
Will you be doing any fix for this?should we wait for a fix?
do you think you can provide me a reproduction including data? I think I need to go an debug this myself but we have at least a smoking gun now.
Will you be doing any fix for this?should we wait for a fix?
of course, this is like a major regression! yet I still don't know what's going on but please bare with me and help me to figure it out!
I can sure, just let me know what do you need in detail.
If you provide me the code and the data that would be awesome?
Here is the test app on GitHub, it is a C# Windows console app.
It generates the data and indexes it into ES using the NEST client, each run will generate/index 2000 docs, you can set the ES node url and index name from the App.config (ElasticUpgradeTest.exe.config) file.
There is a compiled exe in the Compiled folder, or you can use Visual Studio 2015 to open the solution and run it from there, the app targets .NET Framework 4.5.2.
The mapping is in the mapping.txt file, it is just one type with some geo fields.
I was able to reproduce the issue using this app, you probably won't need my index data files as they are all generated using the app, if you needed it I can provide it too.
Let me know if you had issues running it.
Another debug info, I think what is causing the problem is the geo fields.
Assuming normal scenario, I create the index then index the data in a new ES version, if I don't set values for my geo fields (simply comment line 73 in my app DataGenerator.cs then everything is good (index size is under 1Mi), but if I set values for my geo fields things go nuts, will work if I only use the work around you mentioned earlier.
I think this is the same as
Try changing your geo-shape mapping to specify:
"distance_error_pct": 0.025
See for the docs
That fixed my problem, thanks!
I would ask though why this behavior was changed from 1.x version? I don't know the full business/technical case but would like to know.
I would say "distance_error_pct" should never be zero unless it is set explicitly.
Thinking more about it and considering the size difference and indexing speed I would say that parameter should never be zero.