Filtering for apostrophes and single quotes confusion?

Recently a customer of ours confused himself and us when he discovered
his corpus of documents contains (at least) two different characters
used for the apostrophe.
I didn't recognize the issue at first.

For those who aren't familiar with this problem, depending on the
history of characters in a document an apostrophe used as a possessive
in English, i.e
"customer's report" (one report from one customer) may be one of
several characters and apparently the 3.x StandardAnalyzer doesn't take
this into consideration.
Did I miss a mention of a fix for this? I was conducting tests against
Lucene 3.4, but didn't see mention in ES either (I'm not running 4.x yet).

While I found some discussion of this issue over the years, I was
surprised to not find any general solution either in Standard Analyzer
or in some extra Filter that I might leverage in a filter chain. Am I
missing something?

I also have to say that not very large test document sets gathered as
ordinary domain examples from the web have now been shown include
different apostrophe characters.

My suggested solution matches the one line from the Snowball page (see
below) "Clearly other codes for apostrophe can be mapped to this
[apostrophe] code prior to stemming." I'm sure I DO NOT want to mess
with things too much. I already don't use standard analyzer (so to not
bother dropping stopwords), so I was thinking a simple filter chainable
before standard filter that looks for odd Apostrophes followed by s and
replaces the odd char with U+0027
http://www.fileformat.info/info/unicode/char/0027/index.htm would do
the trick.

Any thoughts or help?

-Paul


All the background information I have on the topic.

Smart Editors (MS Word and I believe Adobe Acrobat) convert the ordinary
single quote/apostrophe key (on the modern English MS keyboard that is
(the un-shifted key on double quote and single-quote (?) key) to various
other characters. Meanwhile, neither browser web page entry boxes (by
default) nor simpler text editors mess with characters typed usually
resulting in an APOSTROPHE.
Paul's example of an apostrophe and a 'full quote' generated by the
Outlook editor.

Paul's example of an apostrophe and a 'full quote' typed into a browser
field.

Assuming my e-mail editor, my e-mailer, the list mailer, your e-mail
program and your viewer all preserved the characters along the way, the
1st line uses 3 different characters the 2nd uses 1. I only typed one
character in all cases.

Various character that might show up include the following:
U+0027
http://www.fileformat.info/info/unicode/char/0027/index.htmAPOSTROPHE
Original ASCII character, probably what your keyboard sends, but I
can't promise anything.
U+0091 http://www.fileformat.info/info/unicode/char/0091/index.htm
Left single quotation mark ASCII ISO 8859-1 ISO Latin 1(Note 1), but is
listed as PRIVATE USE ONLY in official Unicode.
U+0092 http://www.fileformat.info/info/unicode/char/0092/index.htm
Right single quotation mark ASCII ISO 8859-1 ISO Latin 1(Note 1), but is
listed as PRIVATE USE ONLY in official Unicode.
U+2018 http://www.fileformat.info/info/unicode/char/2018/index.htm
LEFT SINGLE QUOTATION MARK The official Unicode character. This is what
I get from the above example generated in 2013.
U+2019
http://www.fileformat.info/info/unicode/char/2019/index.htmRIGHT
SINGLE QUOTATION MARK The official Unicode character. This is what I
get from the above example generated in 2013.
U+2019 http://www.fileformat.info/info/unicode/char/201B/index.htm
SINGLE HIGH-REVERSED-9 QUOTATION MARK Mentioned as a special case use
at Tartarus.org in other contexts, eg. O'Reilly (see link below).

Standards are just crazy things in the real world since they are never
followed fully.
Typing a single quote from the keyboard into the website
http://www.babelstone.co.uk/unicode/whatisit.html
using either Firefox or IE reports back that it got U+0027 - the old
fashion apostrophe, but Unicode at the page for U+2019
http://www.fileformat.info/info/unicode/char/2019/index.htm says
[U+2019] "is the preferred character to use for apostrophe".

The Snowball parser folks spotted the problem and summarized it at:
http://snowball.tartarus.org/texts/apostrophe.html
But I didn't see any Filters there either, but maybe I didn't search
well enough, but then maybe I used the wrong apostrophe when searching.

-Paul

(1) ISO 8859-1 ISO Latin 1 http://www.ascii-code.com/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Paul

Interesting...

From this:

For these reasons, the English stemmer treats apostrophe as a letter,
removing it from the beginning of a word, where it might have stood for an
opening quote, from the end of the word, where it might have stood for a
closing quote, or been an apostrophe following s. The form ’s is also
treated as an ending.

... it sounds like things will work correctly as long as you normalize all
single quotes/apostrophes to the same character, which you can do with a
char filter:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"quotes" : {
"filter" : [
"standard",
"lowercase"
],
"char_filter" : [
"quotes"
],
"tokenizer" : "standard"
}
},
"char_filter" : {
"quotes" : {
"mappings" : [
"\u0091=>\u0027",
"\u0092=>\u0027",
"\u2018=>\u0027",
"\u2019=>\u0027"
],
"type" : "mapping"
}
}
}
}
}
'

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty&analyzer=quotes' -d '
Paul’s example of an apostrophe and a ‘full quote’ generated by the Outlook
editor.
'

{

"tokens" : [

{

"end_offset" : 6,

"position" : 1,

"start_offset" : 0,

"type" : "",

"token" : "paul's"

},

{

"end_offset" : 14,

"position" : 2,

"start_offset" : 7,

"type" : "",

"token" : "example"

},

{

"end_offset" : 17,

"position" : 3,

"start_offset" : 15,

"type" : "",

"token" : "of"

},

{

"end_offset" : 20,

"position" : 4,

"start_offset" : 18,

"type" : "",

"token" : "an"

},

{

"end_offset" : 31,

"position" : 5,

"start_offset" : 21,

"type" : "",

"token" : "apostrophe"

},

{

"end_offset" : 35,

"position" : 6,

"start_offset" : 32,

"type" : "",

"token" : "and"

},

{

"end_offset" : 37,

"position" : 7,

"start_offset" : 36,

"type" : "",

"token" : "a"

},

{

"end_offset" : 43,

"position" : 8,

"start_offset" : 39,

"type" : "",

"token" : "full"

},

{

"end_offset" : 49,

"position" : 9,

"start_offset" : 44,

"type" : "",

"token" : "quote"

},

{

"end_offset" : 60,

"position" : 10,

"start_offset" : 51,

"type" : "",

"token" : "generated"

},

{

"end_offset" : 63,

"position" : 11,

"start_offset" : 61,

"type" : "",

"token" : "by"

},

{

"end_offset" : 67,

"position" : 12,

"start_offset" : 64,

"type" : "",

"token" : "the"

},

{

"end_offset" : 75,

"position" : 13,

"start_offset" : 68,

"type" : "",

"token" : "outlook"

},

{

"end_offset" : 82,

"position" : 14,

"start_offset" : 76,

"type" : "",

"token" : "editor"

}

]

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On 5/24/2013 3:22 AM, Clinton Gormley wrote:

Hi Paul

Interesting...

From this:

For these reasons, the English stemmer treats apostrophe as a letter,
removing it from the beginning of a word, where it might have stood
for an opening quote, from the end of the word, where it might have
stood for a closing quote, or been an apostrophe following /s/. The
form /’s/ is also treated as an ending.

... it sounds like things will work correctly as long as you normalize
all single quotes/apostrophes to the same character, which you can do
with a char filter:

Thanks for the response. Your are right, the Snowball parser will be
happy with just simple character replacement, so there's no need to try
to identify only "xxx's" occurrences using a custom token filter. I'm
a little nervous that I'd throw off some other Filter, but my particular
configuration is all under my control, so all is good.

Just to complete the record, while looking at the Lucene code, I did
spot that the simple EnglishPossesiveFilter also thinks its worth
looking for one more.
U+FF07. In Unicode this is called FULLWIDTH APOSTROPHE, which seems
more likely than the one I listed in my original list (note the URL link
was right the URL text was wrong) U+201B SINGLE HIGH-REVERSED-9
QUOTATION MARK (Gosh what a name!) even if that is used in some actual
non-technical published documents as mentioned on the Snowball page, but
not as a possessive or a quote.

I'd suggest your example char filter should get one more entry for this
fat or full-width apostrophe.

"char_filter" : {
"quotes" : {
"mappings" : [
"\u0091=>\u0027",
"\u0092=>\u0027",
"\u2018=>\u0027",
"\u2019=>\u0027"
"\uFF07=>\u0027"
],
"type" : "mapping"
}
}

I'd leave out all the myriad others characters that look like
apostrophes which all seem to be special linguistic marks which I hope
remain part of the term for any linguist processing or maybe further
filtered away when no one cares.

-Paul

p.s. The fully correct way to spell Hawaii uses one of those really
special characters -- Hawai?i. see


"used in Hawaiian orthography asokina (glottal stop)" (but that last
sentence used what many folks use for glottal stop - a grave accent).
Now you too can form sentences of the form "Hawai?i's language
orthography has it's own special characters."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.