Different Index sizes for same data


(Lavesh Gupta) #1

Hi Everyone,

I am using the following configuration

2 Nodes, Number of Shards: 4, Number of Replicas: 0

I am currently indexing 50,000 (50K) files using pyelasticsearch of size
amounting to 6 GB.

For indexing I am increasing the number of threads from 1 to 8 and each
time I am getting an index having different size.

Num Threads Time taken for Indexing Size of index on
Node 1 Size of Index on node 2
1 4069.559
s 3.50
GB 3.22 GB
2 2236.544
s 4.61
GB 4.54 GB
4 1990.098
s 5.45
GB 5.31 GB
8 1965.987
s 2.94
GB 2.96 GB

The mapping I am using is
dtype: {
"_source": {"enabled": False},
"_all": {"enabled": False},
"properties": {
"filecontent": {"type": "string", "store": False},
"filename": {"type": "string", "index": "not_analyzed",
"store": True},
"filepath": {"type": "string", "index": "not_analyzed",
"store": True},
"filetype": {"type": "string", "index": "not_analyzed",
"store": True},
"tokens": {"type": "string", "store": True},
"rules": {"type": "string", "store": True}
}
}

where in FIELD "filecontent" I am passing extracted text of the file which
I got from using Tika
for Field "tokens" I am storing some values I get from the text by running
my regex and based on my values I populate Field "rules"

My question is why there is a discrepancy in size of index formed when I
just changing number of threads to send indexing requests.

Please note: After Indexing has been completed, I am letting ES to cool
down so that merging of segments can be achieved.

Please let me know why the discrepancy in Index size

Thanks,
Lavesh

--

This message contains confidential information and is intended only for the
individual to whom it is addressed. If you are not the intended recipient,
you should not disseminate, distribute or copy this e-mail. Please notify
the sender immediately by e-mail if you have received this e-mail by
mistake and permanently delete this e-mail from your system. E-mail
transmission cannot be guaranteed to be secure or error-free as information
could be intercepted, corrupted, lost, destroyed, late or incomplete, or
could contain viruses. The sender therefore does not accept liability for
any errors or omissions in the contents of this message, which arise as a
result of e-mail transmission. If verification is required, please request
a hard-copy version from the sender. Druva, www.druva.com

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/122a8818-5217-41f2-ab65-316191f1aa7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Nik Everett) #2

Its hard to tell exactly what you did so there could be lots of reasons for
the differences. One of them is that indexing isn't deterministic because
merges trigger based on the on disk shape and on the number of other merges
running. And they typically run async from your index commands.

If you run optimize after the inserts then you should get very close
numbers. In prod optimize is only good if you are done indexing or
updating. It causes trouble later on otherwise.

Nik
On May 19, 2015 7:04 AM, "Lavesh Gupta" lavesh.gupta@druva.com wrote:

Hi Everyone,

I am using the following configuration

2 Nodes, Number of Shards: 4, Number of Replicas: 0

I am currently indexing 50,000 (50K) files using pyelasticsearch of size
amounting to 6 GB.

For indexing I am increasing the number of threads from 1 to 8 and each
time I am getting an index having different size.

Num Threads Time taken for Indexing Size of index on
Node 1 Size of Index on node 2
1 4069.559
s 3.50
GB 3.22 GB
2 2236.544
s 4.61
GB 4.54 GB
4 1990.098
s 5.45
GB 5.31 GB
8 1965.987
s 2.94
GB 2.96 GB

The mapping I am using is
dtype: {
"_source": {"enabled": False},
"_all": {"enabled": False},
"properties": {
"filecontent": {"type": "string", "store": False},
"filename": {"type": "string", "index": "not_analyzed",
"store": True},
"filepath": {"type": "string", "index": "not_analyzed",
"store": True},
"filetype": {"type": "string", "index": "not_analyzed",
"store": True},
"tokens": {"type": "string", "store": True},
"rules": {"type": "string", "store": True}
}
}

where in FIELD "filecontent" I am passing extracted text of the file which
I got from using Tika
for Field "tokens" I am storing some values I get from the text by running
my regex and based on my values I populate Field "rules"

My question is why there is a discrepancy in size of index formed when I
just changing number of threads to send indexing requests.

Please note: After Indexing has been completed, I am letting ES to cool
down so that merging of segments can be achieved.

Please let me know why the discrepancy in Index size

Thanks,
Lavesh

This message contains confidential information and is intended only for
the individual to whom it is addressed. If you are not the intended
recipient, you should not disseminate, distribute or copy this e-mail.
Please notify the sender immediately by e-mail if you have received this
e-mail by mistake and permanently delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or error-free as
information could be intercepted, corrupted, lost, destroyed, late or
incomplete, or could contain viruses. The sender therefore does not accept
liability for any errors or omissions in the contents of this message,
which arise as a result of e-mail transmission. If verification is
required, please request a hard-copy version from the sender. Druva,
www.druva.com

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/122a8818-5217-41f2-ab65-316191f1aa7e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/122a8818-5217-41f2-ab65-316191f1aa7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3Mhu%3DXV_ByCzhy6whhiAYUh8BsypZEMBkWZjJhAoqEPg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3