I'm really enjoying using ElasticSearch, as it seems built to deal with
every common search problem I routinely have at work, and the documentation
is great.
One thing I feel like I'm not fully understanding is _id. I'm making it a
hash of the most important parts of my documents that should not be
changing over time. Should keeping the value of _id pretty compact be a
goal? How will it reflect resources used per million documents if the _id
is 16 characters vs. 64 characters? Each document is around one or two
kilobytes and is comprised of a mix of text fields and enum style fields,
with a couple doubles thrown along.
I'm not sure if I got your question right, but since _id is unique anyway,
the only downside I see for using bigger IDs is your documents getting
bigger. Which will imply more stress on disk and memory. But then if your
docs are aroud 1-2K each, I think 16 vs 64 characters won't get you any
significant difference.
I'm really enjoying using Elasticsearch, as it seems built to deal with
every common search problem I routinely have at work, and the documentation
is great.
One thing I feel like I'm not fully understanding is _id. I'm making it a
hash of the most important parts of my documents that should not be
changing over time. Should keeping the value of _id pretty compact be a
goal? How will it reflect resources used per million documents if the _id
is 16 characters vs. 64 characters? Each document is around one or two
kilobytes and is comprised of a mix of text fields and enum style fields,
with a couple doubles thrown along.
I'll try to clarify. Originally (back in June), I used 32 character strings
for my _id values. I saw a few comments scattered through the ES docs about
certain operations loading all of the _id values into memory. The _id
values propagate out of Elasticsearch into our website and into URLs, and
it's advantageous for us to keep HTML (filled with links referencing ES
documents) more terse. Also, I noticed with Elasticsearch Head that my
indices saved a bit of space (one or two GB out of 50 GB) by making the _id
values shorter. Now, I'm using 15 character strings for my _id values.
Without understanding how things are implemented, I don't know if the _id
value for a document is copied into a few index structures, or if there is
a very compact internal identifier used when that's needed. Now a colleague
is asking me how _id works and it's one of the areas I feel vague about.
The whole reason I'm curious and thinking about it is because it's pretty
well documented that keeping your InnoDB primary keys short is very
efficient when it comes to memory usage. Coming from a MySQL background, I
don't want to make the same mistakes all over again.
Thanks for your help so far, and I hope that clears up my question.
Jim
On Wednesday, December 12, 2012 5:29:30 AM UTC-5, Radu Gheorghe wrote:
Hello Jim,
I'm not sure if I got your question right, but since _id is unique anyway,
the only downside I see for using bigger IDs is your documents getting
bigger. Which will imply more stress on disk and memory. But then if your
docs are aroud 1-2K each, I think 16 vs 64 characters won't get you any
significant difference.
On Tue, Dec 11, 2012 at 10:37 PM, Jim Dickinson <jim.di...@gmail.com<javascript:>
wrote:
I'm really enjoying using Elasticsearch, as it seems built to deal with
every common search problem I routinely have at work, and the documentation
is great.
One thing I feel like I'm not fully understanding is _id. I'm making it a
hash of the most important parts of my documents that should not be
changing over time. Should keeping the value of _id pretty compact be a
goal? How will it reflect resources used per million documents if the _id
is 16 characters vs. 64 characters? Each document is around one or two
kilobytes and is comprised of a mix of text fields and enum style fields,
with a couple doubles thrown along.
I think I see what you're after. Thanks for clearing it up!
Unfortunately, I don't know all the details of how _id is used for all ES
operations. Maybe someone else can jump in and explain, but I still think
it's unrealistic to try and cover all possibilities.
Loading IDs in memory is typically done with parent-child queries. Like the
Has Child or Top Children queries. So if you use those, keeping your _id
fields small would help your memory usage.
Besides that, I would normally not worry about it. But if you can provide
some more details, like how your data and queries look like, I could share
my opinion.
Opinions aside, the safest way to go is to test the difference yourself.
You could do performance tests on your data with your queries: one run with
16-char IDs and one with 64-char IDs. Monitor your cluster while doing that
and you can see what are the differences in your particular usecase.
If you need a good monitoring tool for ES, I would recommend our SPM:
I'll try to clarify. Originally (back in June), I used 32 character
strings for my _id values. I saw a few comments scattered through the ES
docs about certain operations loading all of the _id values into memory.
The _id values propagate out of Elasticsearch into our website and into
URLs, and it's advantageous for us to keep HTML (filled with links
referencing ES documents) more terse. Also, I noticed with Elasticsearch
Head that my indices saved a bit of space (one or two GB out of 50 GB) by
making the _id values shorter. Now, I'm using 15 character strings for my
_id values. Without understanding how things are implemented, I don't know
if the _id value for a document is copied into a few index structures, or
if there is a very compact internal identifier used when that's needed. Now
a colleague is asking me how _id works and it's one of the areas I feel
vague about.
The whole reason I'm curious and thinking about it is because it's pretty
well documented that keeping your InnoDB primary keys short is very
efficient when it comes to memory usage. Coming from a MySQL background, I
don't want to make the same mistakes all over again.
Thanks for your help so far, and I hope that clears up my question.
Jim
On Wednesday, December 12, 2012 5:29:30 AM UTC-5, Radu Gheorghe wrote:
Hello Jim,
I'm not sure if I got your question right, but since _id is unique
anyway, the only downside I see for using bigger IDs is your documents
getting bigger. Which will imply more stress on disk and memory. But then
if your docs are aroud 1-2K each, I think 16 vs 64 characters won't get you
any significant difference.
I'm really enjoying using Elasticsearch, as it seems built to deal with
every common search problem I routinely have at work, and the documentation
is great.
One thing I feel like I'm not fully understanding is _id. I'm making it
a hash of the most important parts of my documents that should not be
changing over time. Should keeping the value of _id pretty compact be a
goal? How will it reflect resources used per million documents if the _id
is 16 characters vs. 64 characters? Each document is around one or two
kilobytes and is comprised of a mix of text fields and enum style fields,
with a couple doubles thrown along.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.