Confused about when and how asciifolding happens


(Chase) #1

Hello again,

In short, I'm trying to set up ElasticSearch so that it always sees
accented characters as a fantastic opportunity for asciifolding.

I've added asciifolding to the default analyzer.

If somebody searches for "Diaz", it should match:

  1. "Diaz", and
  2. "Díaz"

If somebody searches for "Díaz", it should match:
3) "Diaz", and
4) "Díaz"

In reality, I get #1 and #2, but not #3 or, quite surprisingly, #4.
Why not?

See https://gist.github.com/972878

In fact, even with default settings, "Díaz" does not seem to match
"Díaz". Maybe this is a problem with the Mac OS X Terminal I'm using?

A related question I have is that if I create a custom analyzer with
asciifolding and specify it in a mapping, but I don't have
asciifolding in the default analyzer, then I only get the benefit of
asciifolding when doing searches on specific fields. In the above
example, if asciifolding was enabled for the name field in its
mapping, then the query "name:diaz" would match "Cameron Díaz", but a
query of "diaz" would not. I sort of understand this as a design
choice, but at the same time, it would be nice (in a principle-of-
least-surprise way) if the filters on a field were always active, no
matter how you're searching it. Intuitively, I assumed that when
asciifolding was turned on, it would tokenize "Díaz" as "diaz" -- but
evidently, not so? Should I be doing something differently?

Thanks in advance,
-Chase


(system) #2