I am currently have in my elasticsearch documents that have a field with a
string of the form "hash_1 hash_2 hash_3 ... hash_n". Each document has
roughly around 5000 to 6000 hashes in this field. Currently, my local
elasticsearch has about 20000 of these documents. I query elasticsearch
with a string "queryhash_1 queryhash_2 ... queryhash_m", which can be
anywhere between 200 to 1000 hashes. I was wondering if there are any
configurations that could help me better solve my problem. I am currently
using query string queries, with whitespace tokenizers and analyzers. I am
also open to ideas on restructuring/reindexing the documents. Thank you!
First, though effectively not a big difference, is to use multiple values
for the field, and have it not_analyzed. Its nicer (i.e. "hash" :
["hash_1", "hash_2"]. When searching, you can use terms query with the list
of terms (and thus no need to parse the query_string).
I am currently have in my elasticsearch documents that have a field with a
string of the form "hash_1 hash_2 hash_3 ... hash_n". Each document has
roughly around 5000 to 6000 hashes in this field. Currently, my local
elasticsearch has about 20000 of these documents. I query elasticsearch
with a string "queryhash_1 queryhash_2 ... queryhash_m", which can be
anywhere between 200 to 1000 hashes. I was wondering if there are any
configurations that could help me better solve my problem. I am currently
using query string queries, with whitespace tokenizers and analyzers. I am
also open to ideas on restructuring/reindexing the documents. Thank you!
Thank you for responding! By "effectively not a big difference" do you
mean that it will not decrease query time all that much? My plan is to one
day have millions of these documents, and I was wondering if query times
would increase with a growth in number of documents. I am really hoping
that the query times scale with the query length, and not the number of
documents. I've read articles online of people achieving indexing rates of
around 1000 documents a second. I am only getting around 20-40 a second, is
this to be expected because of the long hash strings? I've tried setting
the refresh interval to -1 during indexing, but it did not make a
significant difference (perhaps it will when I have millions of documents).
I hope to hear from you again, thanks!
Documents are put into ES from JSON dump files. I expect to receive and
inject these JSON files into ES consistently. New documents do not
invalidate older ones, and each should be treated as a unique document. I
am essentially trying to solve a clustering/categorization/prediction
problem, where given some input, I want to find the most similar documents
to it (think k nearest neighbors). Each hash for a document represents an
attribute for the document, so the more hashes a query has in common with a
document, the more attributes it has in common with the document. What I
want from a query (which ES is doing right now, though a bit slowly), is to
return all the documents above a certain match % (for example, 10% of the
number of hashes in the query) - or for simplicity, just above some
constant threshold number. The issues I am running into right now are that
indexing is taking a very long time, and querying is also taking quite a
while (anywhere from two to ten seconds). Any suggestions would be
appreciated, thank you.
Your query needs are quite different from normal full text search, so
you will need to tune your mapping accordingly.
You might get some gain from indexing your 'hashes' as integers rather
than as text. Also, you probably want to tweak things like disabling
norms, disabling term frequency etc, and you may want to use span
queries instead of normal term or text queries.
I'd suggest buying a book like Lucene in Action to understand more about
what is happening under the covers. I think that's the only way you're
going to get really good results with the type of queries you want to
run.
clint
On Mon, 2012-06-11 at 06:51 -0700, Alex wrote:
Thank you for responding! By "effectively not a big
difference" do you mean that it will not decrease query time
all that much? My plan is to one day have millions of these
documents, and I was wondering if query times would increase
with a growth in number of documents. I am really hoping that
the query times scale with the query length, and not the
number of documents. I've read articles online of people
achieving indexing rates of around 1000 documents a second. I
am only getting around 20-40 a second, is this to be expected
because of the long hash strings? I've tried setting the
refresh interval to -1 during indexing, but it did not make a
significant difference (perhaps it will when I have millions
of documents). I hope to hear from you again, thanks!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.