Advices indexing MD5 or same kind of data

First of all I am new to the elasticsearch community and would like to congratulate everybody related to what seems like a very promising project and especially kimchy of course :slight_smile:

Now straight to the point:

let's say I want to index a MD5 in a "md5" field of my document and make it searchable so with a config like this:
"md5" : {"index" : "not_analyzed" , "type" : "string" , "store" : "yes"}.

my md5 field can take 2^128 different values.

Is it a problem for Elasticsearch/lucene to handle dictionaries with so much diversity, 2^128 different terms in the worst case scenario?

Would it be a better approach to use a specific MD5Tokenizer that splits my MD5 input in for example 8 blocks of 4 with a distinct prefix and use the same tokenizer at index time and query time.

f71dbe52628a3f83a77ab494817525c6 would become: af71d bbe52 c628a etc...

With this approach the number of distinct terms for MD5s in the dictionary would be limited to 524288 and I guess the search performance wouldn't be too much impacted by a conjonctive query of 8 blocks.

Is it a good idea? Am I missing something here?

Hi Benjamin,

I've done some tests, and apparently there is no reason to split values.
Here is my test :

  • Same computer (4 cores @2.2Ghz, 6 GB of RAM), two instances of elasticsearch running.
  • 2 indexes, 1 shard, 0 replica, _source and _all disabled, one string field, not_analyzed, not stored.
    For the first index (raw), the field will contain the MD5 hash directly.
    For the second (splitted), the field will contain a collection of 8, 5-characters values, which are the 8, 4-characters chunks prefixed by A,B,C,D,E,F,G or H to identify them.
    Each index has been persisted through the use of NIOFS and was managed by one instance of elasticsearch (as I was testing and re-testing, the indices weren't always on the same node).
  • I have indexed 50,000,000 documents which id are a number from 1 to 50,000,000 inclusive, and md5 (the field) is the hash of "${id}\n".
  • I have run 10,000 search queries with random MD5-hashed ids within the range 1-10,000 and 10,000 queries within 49,990,000-50,000,000 (in order to test whether there is any difference), all directed to the master node.
    Queries to the raw index were : q=md5:68b329da9893e34099c7d8ad5cb9c940 (for id 1).
    Queries to the splitted index were : q=md5:A68b3 +md5:B29da +md5:C9893 +md5:De340 +md5:E99c7 +md5:Fd8ad +md5:G5cb9 +md5:Hc940 (url encoded, of course).
    I always check the number of hits to be equal to 1.

Here are the results. You can check out the full log attached.
md5hash.sh.out

  • Indexing is way faster (110%) for the raw index, but with previous tests (less docs), the difference didn't seem quite significant.
  • Index size are almost the same, but with previous tests (1,000,000 docs) the splitted index was a bit (say 30%) bigger.
    I'll test with 100,000,000 docs to see how the index size evolve.
  • Querying is faster (2%) against the raw index.
  • Querying is a bit faster (2%) against the last inserted docs.

Opening the index with luke gives me the following information :

  • The is no MD5 collision for the raw index (fortunately).
  • There are 524,288 (=2^16*8) different indexed values for the splitted index (all possible chunks have been indexed).
    They seem to follow a normal law with 76.2939 collision on average, with a variance of 8.7341²=76.2848.

Best regards,
Olivier

The original question, is still not approved.
You can read it at : http://elasticsearch-users.115913.n3.nabble.com/Advices-indexing-MD5-or-same-kind-of-data-td2867646.html

In my previous post, you should have read 5,000,000 instead of 50,000,000. Here are the results for 10,000,000, 20,000,000 and 50,000,000 :

  • md5hash.sh.out.10000000
  • md5hash.sh.out.20000000
  • md5hash.sh.out.50000000

For 50,000,000 indexed documents, I have a mean of 762.9395 collision and a variance of 27.6190²=762.8097 for the splitted index.
All my conclusion were biaised because I didn't read the times correctly (real should be used, I added real+user+sys) :

  • Requests on splitted index are 13% slower,
  • Requests on splitted index or raw index have quite the same performance, raw queries can be faster.

Size gap between the two indices seems to be of 3.5-4.0 MB per 10^6 document indexed, in favor of the splitted index.
I remind you that I was almost at equality for 5,000,000 documents indexed.

Can you try and register to the google group and post the query through it, not through nabble? There is no mails stuck in the groups control panel (and I do no filtering, only google spam one).
On Friday, April 29, 2011 at 4:09 PM, ofavre wrote:

The original question, is still not approved.
You can read it at : http://elasticsearch-users.115913.n3.nabble.com/Advices-indexing-MD5-or-same-kind-of-data-td2867646.html
In my previous post, you should have read 5,000,000 instead of 50,000,000. Here are the results for 10,000,000, 20,000,000 and 50,000,000 :
md5hash.sh.out.10000000
md5hash.sh.out.20000000
md5hash.sh.out.50000000

For 50,000,000 indexed documents, I have a mean of 762.9395 collision and a variance of 27.6190²=762.8097 for the splitted index.
All my conclusion were biaised because I didn't read the times correctly (real should be used, I added real+user+sys) :
Requests on splitted index are 13% slower,
Requests on splitted index or raw index have quite the same performance, raw queries can be faster.

Size gap between the two indices seems to be of 3.5-4.0 MB per 10^6 document indexed, in favor of the splitted index.
I remind you that I was almost at equality for 5,000,000 documents indexed.
View this message in context: Re: Advices indexing MD5 or same kind of data
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

And, to answer your question, splitting md5 values will reduce the number of total tokens indexed. Where is plays a major part memory wise is Lucene loading intervals of those terms to memory (by default, every 128th term). But, I suggest maybe first testing if you get problems memory wise (sadly, how much memory that takes is not exposed as there isn't a nice API to get it (yet :slight_smile: ).
On Friday, April 29, 2011 at 4:12 PM, Shay Banon wrote:

Can you try and register to the google group and post the query through it, not through nabble? There is no mails stuck in the groups control panel (and I do no filtering, only google spam one).
On Friday, April 29, 2011 at 4:09 PM, ofavre wrote:

The original question, is still not approved.
You can read it at : http://elasticsearch-users.115913.n3.nabble.com/Advices-indexing-MD5-or-same-kind-of-data-td2867646.html
In my previous post, you should have read 5,000,000 instead of 50,000,000. Here are the results for 10,000,000, 20,000,000 and 50,000,000 :
md5hash.sh.out.10000000
md5hash.sh.out.20000000
md5hash.sh.out.50000000

For 50,000,000 indexed documents, I have a mean of 762.9395 collision and a variance of 27.6190²=762.8097 for the splitted index.
All my conclusion were biaised because I didn't read the times correctly (real should be used, I added real+user+sys) :
Requests on splitted index are 13% slower,
Requests on splitted index or raw index have quite the same performance, raw queries can be faster.

Size gap between the two indices seems to be of 3.5-4.0 MB per 10^6 document indexed, in favor of the splitted index.
I remind you that I was almost at equality for 5,000,000 documents indexed.
View this message in context: Re: Advices indexing MD5 or same kind of data
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

One more thing, you will need to make sure that the query takes into account ordering of those split terms, which makes it more difficult and more expensive to query. And mapping and memory wise, make sure to set omit_norms to true, and omit_term.. to true. This will make indexing and memory usage of it more lightweight.
On Friday, April 29, 2011 at 4:16 PM, Shay Banon wrote:

And, to answer your question, splitting md5 values will reduce the number of total tokens indexed. Where is plays a major part memory wise is Lucene loading intervals of those terms to memory (by default, every 128th term). But, I suggest maybe first testing if you get problems memory wise (sadly, how much memory that takes is not exposed as there isn't a nice API to get it (yet :slight_smile: ).
On Friday, April 29, 2011 at 4:12 PM, Shay Banon wrote:

Can you try and register to the google group and post the query through it, not through nabble? There is no mails stuck in the groups control panel (and I do no filtering, only google spam one).
On Friday, April 29, 2011 at 4:09 PM, ofavre wrote:

The original question, is still not approved.
You can read it at : http://elasticsearch-users.115913.n3.nabble.com/Advices-indexing-MD5-or-same-kind-of-data-td2867646.html
In my previous post, you should have read 5,000,000 instead of 50,000,000. Here are the results for 10,000,000, 20,000,000 and 50,000,000 :
md5hash.sh.out.10000000
md5hash.sh.out.20000000
md5hash.sh.out.50000000

For 50,000,000 indexed documents, I have a mean of 762.9395 collision and a variance of 27.6190²=762.8097 for the splitted index.
All my conclusion were biaised because I didn't read the times correctly (real should be used, I added real+user+sys) :
Requests on splitted index are 13% slower,
Requests on splitted index or raw index have quite the same performance, raw queries can be faster.

Size gap between the two indices seems to be of 3.5-4.0 MB per 10^6 document indexed, in favor of the splitted index.
I remind you that I was almost at equality for 5,000,000 documents indexed.
View this message in context: Re: Advices indexing MD5 or same kind of data
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

I've written a HashSplitter Tokenizer (and TokenFilter), which splits any hash (or value) into tokens of same size, each with a prefix that indicates its position.
For instance, with a chunk length of 3, and prefixes set to "0123456789", the value "foobarbaz" becomes ["0foo","1bar","2baz"].

This works well, and I've gotten inspiration from NGramTokenizer and NGramTokenFilter for implementation, and from the ICU plugin to create a packaged plugin.
The configuration of the values for chunk_length and prefixes is done withing config/elasticsearch.yml:

index:
  analysis:
    analyzer:
      md5_hashsplitter:
        type: custom
        tokenizer: md5_hashsplitter_tokenizer
    tokenizer:
      md5_hashsplitter_tokenizer:
        type: hash_splitter
        chunk_length: 4
        prefixes: ABCDEFGH

Indexation and querying now works transparently, which is great, but I have a bug:
As I have many collision for every chunk, I get many results for a query like q=md5:d41d8cd98f00b204e9800998ecf8427e ...

If I add &default_operator=AND then I get only one result, as expected.
My question is:
Is there a way to make the default_operator parameter automatic for a specific field? (I think no because this has to do with the querying, not the indexing, so no configuration seems possible.)
Or at least is it possible to make AND the default_operator for a specific field inside the query in a good fashion?
Or even better, is it possible to modify something in the code somewhere (in my TokenFilter or Tokenizer) to make sure is does a AND?
Maybe a TokenFilter/Tokenizer for the query (not the indexing), that adds a "+" before the output token...


Thank you for your help!
Enjoy your week end.

Olivier @Yakaz

There isn't an option to control the operator based on a field name, but, you can create a specific mapping type (check the attachment plugin for an example) that is an md5 type. There, you can actually control the query that will be used when querying against that field in query parser case, and in general "term" queries. So, you can construct your boolean query there, and break it up yourself.

If you have a specific type, you won't have to have a special tokenizer, since you can construct the string to be indexed yourself, and then use something like whitespace analyzer on it (a bit simpler, at least an option).

Still, I would check if you really need this, and you are not optimizing before you really need to.

On Friday, April 29, 2011 at 8:27 PM, ofavre wrote:
I've written a HashSplitter Tokenizer (and TokenFilter), which splits any hash (or value) into tokens of same size, each with a prefix that indicates its position.

For instance, with a chunk length of 3, and prefixes set to "0123456789", the value "foobarbaz" becomes ["0foo","1bar","2baz"].
This works well, and I've gotten inspiration from NGramTokenizer and NGramTokenFilter for implementation, and from the ICU plugin to create a packaged plugin.
The configuration of the values for chunk_length and prefixes is done withing config/elasticsearch.yml:
index: analysis: analyzer: md5_hashsplitter: type: custom tokenizer: md5_hashsplitter_tokenizer tokenizer: md5_hashsplitter_tokenizer: type: hash_splitter chunk_length: 4 prefixes: ABCDEFGH
Indexation and querying now works transparently, which is great, but I have a bug:
As I have many collision for every chunk, I get many results for a query like q=md5:d41d8cd98f00b204e9800998ecf8427e ...
If I add &default_operator=AND then I get only one result, as expected.
My question is:
Is there a way to make the default_operator parameter automatic for a specific field? (I think no because this has to do with the querying, not the indexing, so no configuration seems possible.)
Or at least is it possible to make AND the default_operator for a specific field inside the query in a good fashion?
Or even better, is it possible to modify something in the code somewhere (in my TokenFilter or Tokenizer) to make sure is does a AND?
Maybe a TokenFilter/Tokenizer for the query (not the indexing), that adds a "+" before the output token...

Thank you for your help!
Enjoy your week end.
Olivier @Yakaz

View this message in context: Re: Advices indexing MD5 or same kind of data
Sent from the Elasticsearch Users mailing list archive at Nabble.com.