Best way to index human names?

Hello! I am building an index that includes personal names and after a bunch of searching around I haven't been able to find the best way to index these names to allow for the large amount of variation in name construction.

For example, take the name "J. R. R. Tolkien". For building the index field, I am using a text field with a custom analyzer that filters out any periods, lowercases, and tokenizes on whitespace. If the source text is "J. R. R. Tolkien" I end up with the tokens ["j", "r", "r", "tolkien"] in my index. However, the user input query text realistically could be (ignoring case) "j.r.r. tolkien", "j. r. r. tolkien", "jrr tolkien", "j r r tolkien" or "tolkien", both with or without spaces and punctuation. I have found that given the tokens mentioned before, I get poor results from a match query when the query has spaces are omitted (i.e. "jrr tolkien") given that each initial is a separate token in the index. It seems like I would want both versions on my name field somehow?

What is the best way to handle cases like this? Do I need to create a separate field for each variation? I do have some millions of documents, so it's also not realistic to create each variation by hand.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.