Upgrade from elastic 1.3.2 to 2.3.1 and more space for the indexes

apparently it's really hard to figure out what is going on without any way to reproduce it. Can somebody try and provide a reproduction that we can try on our end without downloading tons of data?

also can somebody provide segments stats for this index curl -XGET 'http://localhost:9200/_segments?verbose=true'

Hi Simon

Pls. find the segment details in the below gist

1.3.2 segments

2.3.1 segments

hey, thanks for the stats. I wonder if you can provide the output with ?verbose=true since I wanted to see the lower level details?! I also wonder why all these shards have more than one segment if you ran a force merge with commit=true&wait_for_completion=true.

There are also difference in the number of segments compared to how many segments are committed. This indicates you didn't refresh after the force merge? Old segment files will be deleted once they go out of scope so there might be readers open holding on to the segments.

While looking at it I realized that distribution of documents might be totally different since we change the hash function which won't allow shard by shard comparison so it's possible that shard X is much bigger in 2.x than in 1.x just due to different distributions. I think what we need is more insight into what takes space can you do run force_merge again and flush AND refresh so we can really tell what the differences are? Also please get us indices stats too https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html

so 2 min for 5029 is already nuts - can you explain who you index your data? ie. I am interested in things like:

  • bulk vs. doc-a-time indexing
  • index settings
  • do you call any APIs per indexing request

I can see some geo fields in your mapping, do you have complex polygons you are indexing? Can you provide some data that takes so long including the mapping etc. so we can try to reproduce?

Simon
We are using couchbase transport plugin, it takes docs from couchbase and indexes them into Elasticsearch.

I am guessing the plugin uses bulk indexing, as things get slightly better when increasing the threadppol.bulk.queue_size (still slow that it is not usable).

The CB plugin does all the indexing, I never index directly in ES, we create/update in CB then the plugin uses XDCR replication to replicate into ES.

Our shapes are all circles, nothing complex, no specific index settings apart from the mapping, difference is startk as I have the same mapping running in old and new version (note that there is a different CB transport plugin version for each ES version).

The easiest way I can show you this is to have an online meeting or something to show the real deal in our test machines.

dom,

I need you to take variables out of the picture, can you try to manually index stuff into ES without the Couchbase plugin? It's crucial to me to figure out what is going on? Can you also paste your index settings please?

Simon,
I can do that, I will write something up and report back.

Simon
I wrote a small C# .NET app to index directly into ES, I got the same size differences and really slow indexing.

For ES 1.7.2 I used NEST client 1.7.2, For ES 2.3.2 I used NEST client 2.3.2, however I think the 1.7.2 client was able index in ES 2.3.2, anyway I tested with different client s to adhere to the supported ES/Client combinations.

Stats from head plugin
ES 1.7.2
upgarde-test
size: 15.5Mi (15.5Mi)
docs: 2,000 (2,000)

ES 2.3.2
upgarde-test
size: 2.84Gi (2.84Gi)
docs: 2,000 (2,000)

Mapping (Same mapping)
{ "mappings": { "tomato": { "_all" :{ "enabled" : false }, "properties": { "id": { "type": "string", "index": "not_analyzed", "doc_values": false }, "dateTimeCreated": { "type": "date" }, "dateTimeModified": { "type": "date" }, "name": { "type": "string" }, "description": { "type": "string" }, "isPublic": { "type": "boolean" }, "tomatoCenter": { "type": "geo_point", "geohash": true, "geohash_prefix": true, "geohash_precision": 3 }, "tomatoShape": { "type": "geo_shape", "tree": "quadtree", "precision": "1m" }, "type": { "type": "string" }, "farmId": { "type": "string", "index": "not_analyzed", "doc_values": false } } } } }

Same Content (Sample), all shapes are "circle"
{ "_index": "upgarde-test", "_type": "tomato", "_id": "AVURtqKTfJzofNCcHO8s", "_version": 1, "_score": 1, "_source": { "description": "Random batchNo5 #tomatoIdx3", "tomatorShape": { "coordinates": [ -87.21241972439854 , 41.53887701360597 ], "type": "circle", "radius": 4444 }, "tomatoCenter": [ -87.21241972439854 , 41.53887701360597 ], "isPublic": true, "name": "Batch#5 Count#3", "tags": [ "batchNo5" , "tomatoIdx3" ], "type": "tomato", "farmId": "farm_c6840442-7312-473c-8501-ed035dcc65bf", "dateTimeCreated": "0001-01-01T00:00:00" } }

C# code to index

var bulkDescriptor = new BulkDescriptor();
            for (int b = 0; b < 200; b++)
            {
                var tomato= GetRandomTomato(i, b);
                bulkDescriptor.Index<Tomato>(op => op.Document(tomato).Index(EsIndexName));
            }

            var result = ElasticNestClient.Bulk(bulkDescriptor);
1 Like

Similar issue reported here

I replied to the other issue, Elasticsearch upgrade from 1.7.1 to 2.3.2 then create index very slow

do you think you can do the same?

Whoa!

I just did that, it made day & night difference.
Index size is even smaller now in the new ES version 15Mi old vs 12.8Mi new, indexing speed is also almost instantaneous considering I just indexed 2000 docs.
ES 2.3.2
upgarde-test
size: 12.8Mi (12.8Mi)
docs: 2,000 (2,000)

Before I was using separate new data folder for the new version, now I created the index in the old version (just index create and no data), used the same data folder for the new version, then indexed data using the new ES version.

Will you be doing any fix for this?should we wait for a fix?

do you think you can provide me a reproduction including data? I think I need to go an debug this myself but we have at least a smoking gun now.

Will you be doing any fix for this?should we wait for a fix?

of course, this is like a major regression! yet I still don't know what's going on but please bare with me and help me to figure it out!

I can sure, just let me know what do you need in detail.
Thanks!

If you provide me the code and the data that would be awesome?

Here is the test app on GitHub, it is a C# Windows console app.
It generates the data and indexes it into ES using the NEST client, each run will generate/index 2000 docs, you can set the ES node url and index name from the App.config (ElasticUpgradeTest.exe.config) file.

There is a compiled exe in the Compiled folder, or you can use Visual Studio 2015 to open the solution and run it from there, the app targets .NET Framework 4.5.2.

The mapping is in the mapping.txt file, it is just one type with some geo fields.

I was able to reproduce the issue using this app, you probably won't need my index data files as they are all generated using the app, if you needed it I can provide it too.

Let me know if you had issues running it.

Another debug info, I think what is causing the problem is the geo fields.

Assuming normal scenario, I create the index then index the data in a new ES version, if I don't set values for my geo fields (simply comment line 73 in my app DataGenerator.cs then everything is good (index size is under 1Mi), but if I set values for my geo fields things go nuts, will work if I only use the work around you mentioned earlier.

I think this is the same as https://github.com/elastic/elasticsearch/issues/17907

Try changing your geo-shape mapping to specify:

"distance_error_pct": 0.025

See https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-shape.html for the docs

That fixed my problem, thanks!

I would ask though why this behavior was changed from 1.x version? I don't know the full business/technical case but would like to know.

I would say "distance_error_pct" should never be zero unless it is set explicitly.

Thinking more about it and considering the size difference and indexing speed I would say that parameter should never be zero.