Searching on UUIDs

UUIDs are basically 16 byte long byte fields. They can be displayed as 128 0s and 1s bits, two loooooong signed decimal (long) integers, a
32 character hexidecimal number, or 24 character base 58 number.

I'd like a field of type uuid that I could filter on. Shojld I use a string type? I will be mostly searching for absingle value as AND filter or a small set of values as an OR filter.

--

I guess you like to care about document unique IDs?

The Solr wiki has some information about unique key construction:
http://wiki.apache.org/solr/UniqueKey

If Elasticsearch was a database, it would have to care about field types,
like a UUID field type.

Lucene, the search engine library for Elasticsearch, converts everything to
strings in the index. For treating numerical field searches,
see http://wiki.apache.org/lucene-java/SearchNumericalFields

There are special data structures for numerical range searches, the
TrieRangeQuery of Schindler/Diepenbroek
http://epic.awi.de/17813/1/Sch2007br.pdf so, there are no numbers in a
Lucene index as values.

Just a few words about other obstacles for a new field type.

You noticed that UUIDs are too large to be converted to (Java) integers or
longs.

Additionally, in ES, the JSON syntax does not care about limits for
numerical values. See http://www.ietf.org/rfc/rfc4627 "An implementation
may set limits on the range of numbers." Fortunately for ES, Jackson, the
JSON implementation, provides an API for java.lang.BigInteger, and could
also transport UUIDs encoded as a 16 byte sized octet array for later
conversion to Lucene format.

With all the conversions, some overhead would be added. It's not clear if
this overhead weighs more than the extra 16 characters of the UUID hex
representation in all cases. But, with today's large storage capacities,
you have more than enough space, which saves us from expensive CPU powered
UUID compaction.

In short, a UUID for ES should be treated just as a usual 32 character hex
string when it comes to indexing.

Jörg

--