Hi Everyone,
I am using the following configuration
2 Nodes, Number of Shards: 4, Number of Replicas: 0
I am currently indexing 50,000 (50K) files using pyelasticsearch of size
amounting to 6 GB.
For indexing I am increasing the number of threads from 1 to 8 and each
time I am getting an index having different size.
Num Threads Time taken for Indexing Size of index on
Node 1 Size of Index on node 2
1 4069.559
s 3.50
GB 3.22 GB
2 2236.544
s 4.61
GB 4.54 GB
4 1990.098
s 5.45
GB 5.31 GB
8 1965.987
s 2.94
GB 2.96 GB
The mapping I am using is
dtype: {
"_source": {"enabled": False},
"_all": {"enabled": False},
"properties": {
"filecontent": {"type": "string", "store": False},
"filename": {"type": "string", "index": "not_analyzed",
"store": True},
"filepath": {"type": "string", "index": "not_analyzed",
"store": True},
"filetype": {"type": "string", "index": "not_analyzed",
"store": True},
"tokens": {"type": "string", "store": True},
"rules": {"type": "string", "store": True}
}
}
where in FIELD "filecontent" I am passing extracted text of the file which
I got from using Tika
for Field "tokens" I am storing some values I get from the text by running
my regex and based on my values I populate Field "rules"
My question is why there is a discrepancy in size of index formed when I
just changing number of threads to send indexing requests.
Please note: After Indexing has been completed, I am letting ES to cool
down so that merging of segments can be achieved.
Please let me know why the discrepancy in Index size
Thanks,
Lavesh
--
This message contains confidential information and is intended only for the
individual to whom it is addressed. If you are not the intended recipient,
you should not disseminate, distribute or copy this e-mail. Please notify
the sender immediately by e-mail if you have received this e-mail by
mistake and permanently delete this e-mail from your system. E-mail
transmission cannot be guaranteed to be secure or error-free as information
could be intercepted, corrupted, lost, destroyed, late or incomplete, or
could contain viruses. The sender therefore does not accept liability for
any errors or omissions in the contents of this message, which arise as a
result of e-mail transmission. If verification is required, please request
a hard-copy version from the sender. Druva, www.druva.com
--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/122a8818-5217-41f2-ab65-316191f1aa7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.