Hash matching queries

Hello elasticsearch members,

I am currently have in my elasticsearch documents that have a field with a
string of the form "hash_1 hash_2 hash_3 ... hash_n". Each document has
roughly around 5000 to 6000 hashes in this field. Currently, my local
elasticsearch has about 20000 of these documents. I query elasticsearch
with a string "queryhash_1 queryhash_2 ... queryhash_m", which can be
anywhere between 200 to 1000 hashes. I was wondering if there are any
configurations that could help me better solve my problem. I am currently
using query string queries, with whitespace tokenizers and analyzers. I am
also open to ideas on restructuring/reindexing the documents. Thank you!

First, though effectively not a big difference, is to use multiple values
for the field, and have it not_analyzed. Its nicer (i.e. "hash" :
["hash_1", "hash_2"]. When searching, you can use terms query with the list
of terms (and thus no need to parse the query_string).

On Wed, Jun 6, 2012 at 5:02 PM, Alex alexahnhome@gmail.com wrote:

Hello elasticsearch members,

I am currently have in my elasticsearch documents that have a field with a
string of the form "hash_1 hash_2 hash_3 ... hash_n". Each document has
roughly around 5000 to 6000 hashes in this field. Currently, my local
elasticsearch has about 20000 of these documents. I query elasticsearch
with a string "queryhash_1 queryhash_2 ... queryhash_m", which can be
anywhere between 200 to 1000 hashes. I was wondering if there are any
configurations that could help me better solve my problem. I am currently
using query string queries, with whitespace tokenizers and analyzers. I am
also open to ideas on restructuring/reindexing the documents. Thank you!

Thank you for responding! By "effectively not a big difference" do you
mean that it will not decrease query time all that much? My plan is to one
day have millions of these documents, and I was wondering if query times
would increase with a growth in number of documents. I am really hoping
that the query times scale with the query length, and not the number of
documents. I've read articles online of people achieving indexing rates of
around 1000 documents a second. I am only getting around 20-40 a second, is
this to be expected because of the long hash strings? I've tried setting
the refresh interval to -1 during indexing, but it did not make a
significant difference (perhaps it will when I have millions of documents).
I hope to hear from you again, thanks!

For clarification,

Documents are put into ES from JSON dump files. I expect to receive and
inject these JSON files into ES consistently. New documents do not
invalidate older ones, and each should be treated as a unique document. I
am essentially trying to solve a clustering/categorization/prediction
problem, where given some input, I want to find the most similar documents
to it (think k nearest neighbors). Each hash for a document represents an
attribute for the document, so the more hashes a query has in common with a
document, the more attributes it has in common with the document. What I
want from a query (which ES is doing right now, though a bit slowly), is to
return all the documents above a certain match % (for example, 10% of the
number of hashes in the query) - or for simplicity, just above some
constant threshold number. The issues I am running into right now are that
indexing is taking a very long time, and querying is also taking quite a
while (anywhere from two to ten seconds). Any suggestions would be
appreciated, thank you.

Hi Alex

We discussed this at length on IRC, but then you timed out.

Try this query: gist:2a127562eb0808408097 · GitHub

Your query needs are quite different from normal full text search, so
you will need to tune your mapping accordingly.

You might get some gain from indexing your 'hashes' as integers rather
than as text. Also, you probably want to tweak things like disabling
norms, disabling term frequency etc, and you may want to use span
queries instead of normal term or text queries.

I'd suggest buying a book like Lucene in Action to understand more about
what is happening under the covers. I think that's the only way you're
going to get really good results with the type of queries you want to
run.

clint

On Mon, 2012-06-11 at 06:51 -0700, Alex wrote:

    Thank you for responding! By "effectively not a big
    difference" do you mean that it will not decrease query time
    all that much? My plan is to one day have millions of these
    documents, and I was wondering if query times would increase
    with a growth in number of documents. I am really hoping that
    the query times scale with the query length, and not the
    number of documents. I've read articles online of people
    achieving indexing rates of around 1000 documents a second. I
    am only getting around 20-40 a second, is this to be expected
    because of the long hash strings? I've tried setting the
    refresh interval to -1 during indexing, but it did not make a
    significant difference (perhaps it will when I have millions
    of documents). I hope to hear from you again, thanks!