Highlight output changed in 0.16


(ruflin-2) #1

With 0.16 the highlight output changed a little bit:

  1. Elastica_Query_HighlightTest::testHightlightSearch
    Failed asserting that two arrays are equal.
    --- Expected
    +++ Actual
    @@ @@
    Array
    (
    [email] => Array
    (
  •        [0] => <em class="highlight">test@test.com</em>
    
  •        [0] => <em class="highlight">test</em>@<em
    

class="highlight">test.com
)

)

The one with only 1 was in version 0.15.2, the output with
multiple is in 0.16. Is this change as expected?


(Clinton Gormley) #2

Hi Ruflin

With 0.16 the highlight output changed a little bit:

  •        [0] => <em class="highlight">test@test.com</em>
    
  •        [0] => <em class="highlight">test</em>@<em
    

class="highlight">test.com

The one with only 1 was in version 0.15.2, the output with
multiple is in 0.16. Is this change as expected?

I think it isn't the highlighting that has changed, but the default
analyzer which now breaks up email addresses into several terms. Before,
email addresses produced a single term.

clint


(Shay Banon) #3

Yes, thats the new behavior in Lucene 3.1. You can now specify a Lucene version on tokenizer/analyzer/... to revert to the old behavior.
On Thursday, April 28, 2011 at 11:14 AM, Clinton Gormley wrote:

Hi Ruflin

With 0.16 the highlight output changed a little bit:

The one with only 1 was in version 0.15.2, the output with
multiple is in 0.16. Is this change as expected?

I think it isn't the highlighting that has changed, but the default
analyzer which now breaks up email addresses into several terms. Before,
email addresses produced a single term.

clint


(Lukáš Vlček) #4

Hi,

I did not have chance to get myself fully familiar with all new Lucene 3.1
analyzers but as far as I understand it is possible to create token filters
specifically for emails, urls and paths based on uax_url_email tokenizer. Is
this directly exposed in ES 0.16 ?

Looking at the original rufin's code that would be the best solution IMHO
(as he is having email addresses in the text). See EmailFilter in
TestUAX29URLEmailTokenizer.java (
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_3_1/lucene/src/test/org/apache/lucene/analysis/TestUAX29URLEmailTokenizer.java?view=markup
)

Regards,
Lukas

On Thu, Apr 28, 2011 at 1:37 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Yes, thats the new behavior in Lucene 3.1. You can now specify a Lucene
version on tokenizer/analyzer/... to revert to the old behavior.

On Thursday, April 28, 2011 at 11:14 AM, Clinton Gormley wrote:

Hi Ruflin

With 0.16 the highlight output changed a little bit:

The one with only 1 was in version 0.15.2, the output with
multiple is in 0.16. Is this change as expected?

I think it isn't the highlighting that has changed, but the default
analyzer which now breaks up email addresses into several terms. Before,
email addresses produced a single term.

clint


(Clinton Gormley) #5

Hi Lukas

I did not have chance to get myself fully familiar with all new Lucene
3.1 analyzers but as far as I understand it is possible to create
token filters specifically for emails, urls and paths based on
uax_url_email tokenizer. Is this directly exposed in ES 0.16 ?

Yes it is:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/uaxurlemail-tokenizer.html
http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathheirarchy-tokenizer.html

clint


(Clinton Gormley) #6

Correct URL for Path Hierarchy:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathhierarchy-tokenizer.html

Yes it is:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/uaxurlemail-tokenizer.html
http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathheirarchy-tokenizer.html

clint


(Lukáš Vlček) #7

I was probably not clear, what I was asking about is if there is any option
how to configure email filter. This means you have a text which contains
several email addresses and the output would be only those email addresses.
That is what that Lucene test does.

On Thu, Apr 28, 2011 at 2:41 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Correct URL for Path Hierarchy:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathhierarchy-tokenizer.html

Yes it is:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/uaxurlemail-tokenizer.html

http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathheirarchy-tokenizer.html

clint


(ruflin-2) #8

@Clinton & Shay: Good to know. Then I will update the tests.

On Apr 28, 1:37 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, thats the new behavior in Lucene 3.1. You can now specify a Lucene version on tokenizer/analyzer/... to revert to the old behavior.

On Thursday, April 28, 2011 at 11:14 AM, Clinton Gormley wrote:

Hi Ruflin

With 0.16 the highlight output changed a little bit:

The one with only 1 was in version 0.15.2, the output with
multiple is in 0.16. Is this change as expected?

I think it isn't the highlighting that has changed, but the default
analyzer which now breaks up email addresses into several terms. Before,
email addresses produced a single term.

clint


(Shay Banon) #9

There isn't a built in EmailFilter, which is used in the test just to make sure the relevant token type is used.
On Thursday, April 28, 2011 at 3:48 PM, Lukáš Vlček wrote:

I was probably not clear, what I was asking about is if there is any option how to configure email filter. This means you have a text which contains several email addresses and the output would be only those email addresses. That is what that Lucene test does.

On Thu, Apr 28, 2011 at 2:41 PM, Clinton Gormley clinton@iannounce.co.uk wrote:

Correct URL for Path Hierarchy:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathhierarchy-tokenizer.html

Yes it is:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/uaxurlemail-tokenizer.html
http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathheirarchy-tokenizer.html

clint


(Lukáš Vlček) #10

Shay,

I think those filters could be useful. For example if I am interested in
emails or URLs and I am not interested in the stuffed text content
inbetween. Just an idea for nice-to-have feature.

Lukas

On Thu, Apr 28, 2011 at 4:06 PM, Shay Banon shay.banon@elasticsearch.comwrote:

There isn't a built in EmailFilter, which is used in the test just to
make sure the relevant token type is used.

On Thursday, April 28, 2011 at 3:48 PM, Lukáš Vlček wrote:

I was probably not clear, what I was asking about is if there is any option
how to configure email filter. This means you have a text which contains
several email addresses and the output would be only those email addresses.
That is what that Lucene test does.

On Thu, Apr 28, 2011 at 2:41 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Correct URL for Path Hierarchy:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathhierarchy-tokenizer.html

Yes it is:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/uaxurlemail-tokenizer.html

http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathheirarchy-tokenizer.html

clint


(Shay Banon) #11

Sounds good.
On Thursday, April 28, 2011 at 5:52 PM, Lukáš Vlček wrote:

Shay,

I think those filters could be useful. For example if I am interested in emails or URLs and I am not interested in the stuffed text content inbetween. Just an idea for nice-to-have feature.

Lukas

On Thu, Apr 28, 2011 at 4:06 PM, Shay Banon shay.banon@elasticsearch.com wrote:

There isn't a built in EmailFilter, which is used in the test just to make sure the relevant token type is used.
On Thursday, April 28, 2011 at 3:48 PM, Lukáš Vlček wrote:

I was probably not clear, what I was asking about is if there is any option how to configure email filter. This means you have a text which contains several email addresses and the output would be only those email addresses. That is what that Lucene test does.

On Thu, Apr 28, 2011 at 2:41 PM, Clinton Gormley clinton@iannounce.co.uk wrote:

Correct URL for Path Hierarchy:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathhierarchy-tokenizer.html

Yes it is:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/uaxurlemail-tokenizer.html
http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathheirarchy-tokenizer.html

clint


(system) #12