Recently a customer of ours confused himself and us when he discovered
his corpus of documents contains (at least) two different characters
used for the apostrophe.
I didn't recognize the issue at first.
For those who aren't familiar with this problem, depending on the
history of characters in a document an apostrophe used as a possessive
in English, i.e
"customer's report" (one report from one customer) may be one of
several characters and apparently the 3.x StandardAnalyzer doesn't take
this into consideration.
Did I miss a mention of a fix for this? I was conducting tests against
Lucene 3.4, but didn't see mention in ES either (I'm not running 4.x yet).
While I found some discussion of this issue over the years, I was
surprised to not find any general solution either in Standard Analyzer
or in some extra Filter that I might leverage in a filter chain. Am I
missing something?
I also have to say that not very large test document sets gathered as
ordinary domain examples from the web have now been shown include
different apostrophe characters.
My suggested solution matches the one line from the Snowball page (see
below) "Clearly other codes for apostrophe can be mapped to this
[apostrophe] code prior to stemming." I'm sure I DO NOT want to mess
with things too much. I already don't use standard analyzer (so to not
bother dropping stopwords), so I was thinking a simple filter chainable
before standard filter that looks for odd Apostrophes followed by s and
replaces the odd char with U+0027
http://www.fileformat.info/info/unicode/char/0027/index.htm would do
the trick.
Any thoughts or help?
-Paul
All the background information I have on the topic.
Smart Editors (MS Word and I believe Adobe Acrobat) convert the ordinary
single quote/apostrophe key (on the modern English MS keyboard that is
(the un-shifted key on double quote and single-quote (?) key) to various
other characters. Meanwhile, neither browser web page entry boxes (by
default) nor simpler text editors mess with characters typed usually
resulting in an APOSTROPHE.
Paul's example of an apostrophe and a 'full quote' generated by the
Outlook editor.
Paul's example of an apostrophe and a 'full quote' typed into a browser
field.
Assuming my e-mail editor, my e-mailer, the list mailer, your e-mail
program and your viewer all preserved the characters along the way, the
1st line uses 3 different characters the 2nd uses 1. I only typed one
character in all cases.
Various character that might show up include the following:
U+0027
http://www.fileformat.info/info/unicode/char/0027/index.htmAPOSTROPHE
Original ASCII character, probably what your keyboard sends, but I
can't promise anything.
U+0091 http://www.fileformat.info/info/unicode/char/0091/index.htm
Left single quotation mark ASCII ISO 8859-1 ISO Latin 1(Note 1), but is
listed as PRIVATE USE ONLY in official Unicode.
U+0092 http://www.fileformat.info/info/unicode/char/0092/index.htm
Right single quotation mark ASCII ISO 8859-1 ISO Latin 1(Note 1), but is
listed as PRIVATE USE ONLY in official Unicode.
U+2018 http://www.fileformat.info/info/unicode/char/2018/index.htm
LEFT SINGLE QUOTATION MARK The official Unicode character. This is what
I get from the above example generated in 2013.
U+2019
http://www.fileformat.info/info/unicode/char/2019/index.htmRIGHT
SINGLE QUOTATION MARK The official Unicode character. This is what I
get from the above example generated in 2013.
U+2019 http://www.fileformat.info/info/unicode/char/201B/index.htm
SINGLE HIGH-REVERSED-9 QUOTATION MARK Mentioned as a special case use
at Tartarus.org in other contexts, eg. O'Reilly (see link below).
Standards are just crazy things in the real world since they are never
followed fully.
Typing a single quote from the keyboard into the website
http://www.babelstone.co.uk/unicode/whatisit.html
using either Firefox or IE reports back that it got U+0027 - the old
fashion apostrophe, but Unicode at the page for U+2019
http://www.fileformat.info/info/unicode/char/2019/index.htm says
[U+2019] "is the preferred character to use for apostrophe".
The Snowball parser folks spotted the problem and summarized it at:
http://snowball.tartarus.org/texts/apostrophe.html
But I didn't see any Filters there either, but maybe I didn't search
well enough, but then maybe I used the wrong apostrophe when searching.
-Paul
(1) ISO 8859-1 ISO Latin 1 http://www.ascii-code.com/
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.