Upgrade from elastic 1.3.2 to 2.3.1 and more space for the indexes

Hi All

I am seeing double the space when we migrated data for one index in es 1.3.2 version to 2.3.1 version even with doc_values false for our non analyzed fields

The index size was 772 GB in 1.3.2 and it became 1.3 TB in 2.3.1 version.
I see the norm files are much bigger now

Any idea if this is expected ...then we need to factor more capacity for the new version

Regards
Sai

Did you run a force merge after you had finished indexing?

Thanks for the reply Mark

The size I mentioned in just after the import from knapsack was done.
I just finished running force merge and still the size has not reduced rather it has increased to 1.8TB

I will dig further. Pls. lmk if you have seen this kind of issue before

Regards
Sai

hey, did you flush and refresh after the force merge? Unless you refresh you won't see the old space used by segments that existed prior to the force merge and that will likely result in higher disk usage? Also can you tell us how you are measure/get the size, is that an API call on ES (if so please post the result here) if not please show the output of your command.

Thanks Simon for your time
I am doing a du -h on data directory of elastic in env1(1.3.2) and env2 new cluster(2.3.1) and below is the example for shard 17.

on env1 1.3.2 version

32G /data/elasticsearch/CDQARTH-ES-PROD/nodes/0/indices/products_t0_v9/17/index

on env2 2.3.1 version

97G ./data/elasticsearch/xxx.prod.xxx.com-data2/CDQARTH-ES-PROD/nodes/0/indices/products_t0_v9/17/index

Here one nvd file which is 63 GB

-rw-r--r--. 1 elasticsearch elasticsearch 63G May 6 22:27 _2nn.nvd

Yes refresh is auto enabled every 15 mins
Also I did refresh manually but not flush after the force_merge to 1 segment

curl -XPOST 'http://search01.prod.elasticsearch.xxx.xxx.com/products_t0_v9/_forcemerge?max_num_segments=1'

Regards
Sai

I have the same issue, coming from 1.7.2 to 2.3.1.
The space usage has gone very high in the new upgraded environment, I am disabling the "_all" field for all the types in the index even though I wasn't doing that in the old version.

In may case I think the space has gone more than 10 fold up, I've geo_point and geo_shape fields for what it worth, which was the case in the old version.

In the old index/version the space usage was about 14KB per doc.
In the new index/version the space usage is about 1MB per doc, even though I disabled indexing a lot of fields that were indexed in the old version.

I created a test set, old version vs new version, same mapping same documents set, below are the stats coming from head plugin.

ES 2.3.1
test-local-01
size: 8.43Gi (8.43Gi)
docs: 5,029 (6,762)

ES 1.7.2
test-local-01
size: 34.8Mi (34.8Mi)
docs: 5,029 (5,172)

I will hold on the upgrade till this is fixed.

Can you post/link to your mappings and a data sample?

Sure, it is also worth mentioning that indexing is prohibitively much slower in the new version.
In the old version it would take no more than 2 mins to index those 5029 documents, in the new version it took a few hours, we have about 8 mil docs in our production index, so it is impossible for us to upgrade at the moment.

Here is our mapping gist, how do you want the data? one doc sample for each type?

Thanks for your help.

That's not right at all.
I know Simon has commented on here so let us take a look and get back to you.

We seem to be having the same issue.

Importing data from Couchbase 4.0 Community to Elastic 2.3.1 the document size seem to have increased by nearly 4 fold, and the size multiple was growing the more it was importing.

We are currently running Couchbase 3.0 Community importing into Elastic 1.5.2. Our documents are typically very small (between 100 - 150 Bytes per doc). On ES 1.5.2 the avg doc size is about 150B. On ES 2.3.1 after only 2.5M docs imported the avg size per doc was 500B.

Also, import speed has slowed down significantly, but not sure where the issue for import speed would reside.

Out of interest for anyone running into the same issues, we also have an issue logged on the elasticsearch-transport-couchbase github page:

Hi All
My apology for the late reply
Just to follow up with my observation

  1. ES 1.3.2 index size was 877 GB
  2. ES 2.3.1 index size just after import became 1.13 TB (which may be because of doc values default to true as i have 2 fields with specific mapping which i did not set to false for doc_values so this is explainable, all other not_analyzed fields are false though. also _all is disabled )

but after this import i started running force_merge (_forcemerge?max_num_segments=1') then the size increased to 1.78 TB which clearly means older segments(all the files related to segements) are not getting deleted. So now this idex size remainsat 1.78 TB.

As I told earlier its the nvd file which takes the max space in a given shard
I see only one segments_2g file in each shard.

Regards
Sai

apparently it's really hard to figure out what is going on without any way to reproduce it. Can somebody try and provide a reproduction that we can try on our end without downloading tons of data?

also can somebody provide segments stats for this index curl -XGET 'http://localhost:9200/_segments?verbose=true'

Hi Simon

Pls. find the segment details in the below gist

1.3.2 segments

2.3.1 segments

hey, thanks for the stats. I wonder if you can provide the output with ?verbose=true since I wanted to see the lower level details?! I also wonder why all these shards have more than one segment if you ran a force merge with commit=true&wait_for_completion=true.

There are also difference in the number of segments compared to how many segments are committed. This indicates you didn't refresh after the force merge? Old segment files will be deleted once they go out of scope so there might be readers open holding on to the segments.

While looking at it I realized that distribution of documents might be totally different since we change the hash function which won't allow shard by shard comparison so it's possible that shard X is much bigger in 2.x than in 1.x just due to different distributions. I think what we need is more insight into what takes space can you do run force_merge again and flush AND refresh so we can really tell what the differences are? Also please get us indices stats too https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html

so 2 min for 5029 is already nuts - can you explain who you index your data? ie. I am interested in things like:

  • bulk vs. doc-a-time indexing
  • index settings
  • do you call any APIs per indexing request

I can see some geo fields in your mapping, do you have complex polygons you are indexing? Can you provide some data that takes so long including the mapping etc. so we can try to reproduce?

Simon
We are using couchbase transport plugin, it takes docs from couchbase and indexes them into Elasticsearch.

I am guessing the plugin uses bulk indexing, as things get slightly better when increasing the threadppol.bulk.queue_size (still slow that it is not usable).

The CB plugin does all the indexing, I never index directly in ES, we create/update in CB then the plugin uses XDCR replication to replicate into ES.

Our shapes are all circles, nothing complex, no specific index settings apart from the mapping, difference is startk as I have the same mapping running in old and new version (note that there is a different CB transport plugin version for each ES version).

The easiest way I can show you this is to have an online meeting or something to show the real deal in our test machines.

dom,

I need you to take variables out of the picture, can you try to manually index stuff into ES without the Couchbase plugin? It's crucial to me to figure out what is going on? Can you also paste your index settings please?