Highlight output changed in 0.16

ruflin_2 · April 28, 2011, 6:54am

With 0.16 the highlight output changed a little bit:

Elastica_Query_HighlightTest::testHightlightSearch
Failed asserting that two arrays are equal.
--- Expected
+++ Actual
@@ @@
Array
(
[email] => Array
(

       [0] => <em class="highlight">test@test.com</em>

       [0] => <em class="highlight">test</em>@<em

class="highlight">test.com
)

)

The one with only 1 was in version 0.15.2, the output with
multiple is in 0.16. Is this change as expected?

Clinton_Gormley · April 28, 2011, 8:14am

Hi Ruflin

With 0.16 the highlight output changed a little bit:

       [0] => <em class="highlight">test@test.com</em>

       [0] => <em class="highlight">test</em>@<em

class="highlight">test.com

The one with only 1 was in version 0.15.2, the output with
multiple is in 0.16. Is this change as expected?

I think it isn't the highlighting that has changed, but the default
analyzer which now breaks up email addresses into several terms. Before,
email addresses produced a single term.

clint

kimchy · April 28, 2011, 11:37am

Yes, thats the new behavior in Lucene 3.1. You can now specify a Lucene version on tokenizer/analyzer/... to revert to the old behavior.
On Thursday, April 28, 2011 at 11:14 AM, Clinton Gormley wrote:

Hi Ruflin

With 0.16 the highlight output changed a little bit:

[0] => test@test.com

[0] => test@test.com

The one with only 1 was in version 0.15.2, the output with
multiple is in 0.16. Is this change as expected?

I think it isn't the highlighting that has changed, but the default
analyzer which now breaks up email addresses into several terms. Before,
email addresses produced a single term.

clint

Lukas_Vlcek1 · April 28, 2011, 12:25pm

Hi,

I did not have chance to get myself fully familiar with all new Lucene 3.1
analyzers but as far as I understand it is possible to create token filters
specifically for emails, urls and paths based on uax_url_email tokenizer. Is
this directly exposed in ES 0.16 ?

Looking at the original rufin's code that would be the best solution IMHO
(as he is having email addresses in the text). See EmailFilter in
TestUAX29URLEmailTokenizer.java (
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_3_1/lucene/src/test/org/apache/lucene/analysis/TestUAX29URLEmailTokenizer.java?view=markup
)

Regards,
Lukas

On Thu, Apr 28, 2011 at 1:37 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Yes, thats the new behavior in Lucene 3.1. You can now specify a Lucene
version on tokenizer/analyzer/... to revert to the old behavior.

On Thursday, April 28, 2011 at 11:14 AM, Clinton Gormley wrote:

Hi Ruflin

With 0.16 the highlight output changed a little bit:

[0] => test@test.com

[0] => test@test.com

The one with only 1 was in version 0.15.2, the output with
multiple is in 0.16. Is this change as expected?

I think it isn't the highlighting that has changed, but the default
analyzer which now breaks up email addresses into several terms. Before,
email addresses produced a single term.

clint

Clinton_Gormley · April 28, 2011, 12:38pm

Hi Lukas

I did not have chance to get myself fully familiar with all new Lucene
3.1 analyzers but as far as I understand it is possible to create
token filters specifically for emails, urls and paths based on
uax_url_email tokenizer. Is this directly exposed in ES 0.16 ?

Yes it is:

clint

Clinton_Gormley · April 28, 2011, 12:41pm

Correct URL for Path Hierarchy:

Yes it is:

Elasticsearch Platform — Find real-time answers at scale | Elastic
Elasticsearch Platform — Find real-time answers at scale | Elastic

clint

Lukas_Vlcek1 · April 28, 2011, 12:48pm

I was probably not clear, what I was asking about is if there is any option
how to configure email filter. This means you have a text which contains
several email addresses and the output would be only those email addresses.
That is what that Lucene test does.

On Thu, Apr 28, 2011 at 2:41 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Correct URL for Path Hierarchy:

Elasticsearch Platform — Find real-time answers at scale | Elastic

Yes it is:

Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch Platform — Find real-time answers at scale | Elastic

clint

ruflin_2 · April 28, 2011, 1:07pm

@Clinton & Shay: Good to know. Then I will update the tests.

On Apr 28, 1:37 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, thats the new behavior in Lucene 3.1. You can now specify a Lucene version on tokenizer/analyzer/... to revert to the old behavior.

On Thursday, April 28, 2011 at 11:14 AM, Clinton Gormley wrote:

Hi Ruflin

With 0.16 the highlight output changed a little bit:

[0] => t...@test.com

[0] => test@test.com

The one with only 1 was in version 0.15.2, the output with
multiple is in 0.16. Is this change as expected?

I think it isn't the highlighting that has changed, but the default
analyzer which now breaks up email addresses into several terms. Before,
email addresses produced a single term.

clint

kimchy · April 28, 2011, 2:06pm

There isn't a built in EmailFilter, which is used in the test just to make sure the relevant token type is used.
On Thursday, April 28, 2011 at 3:48 PM, LukÃ¡Å¡ VlÄek wrote:

I was probably not clear, what I was asking about is if there is any option how to configure email filter. This means you have a text which contains several email addresses and the output would be only those email addresses. That is what that Lucene test does.

On Thu, Apr 28, 2011 at 2:41 PM, Clinton Gormley clinton@iannounce.co.uk wrote:

Correct URL for Path Hierarchy:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathhierarchy-tokenizer.html

Yes it is:

Elasticsearch Platform — Find real-time answers at scale | Elastic
Elasticsearch Platform — Find real-time answers at scale | Elastic

clint

Lukas_Vlcek1 · April 28, 2011, 2:52pm

Shay,

I think those filters could be useful. For example if I am interested in
emails or URLs and I am not interested in the stuffed text content
inbetween. Just an idea for nice-to-have feature.

Lukas

On Thu, Apr 28, 2011 at 4:06 PM, Shay Banon shay.banon@elasticsearch.comwrote:

There isn't a built in EmailFilter, which is used in the test just to
make sure the relevant token type is used.

On Thursday, April 28, 2011 at 3:48 PM, Lukáš Vlček wrote:

I was probably not clear, what I was asking about is if there is any option
how to configure email filter. This means you have a text which contains
several email addresses and the output would be only those email addresses.
That is what that Lucene test does.

On Thu, Apr 28, 2011 at 2:41 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Correct URL for Path Hierarchy:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathhierarchy-tokenizer.html

Yes it is:

Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch Platform — Find real-time answers at scale | Elastic

clint

kimchy · April 28, 2011, 8:37pm

Sounds good.
On Thursday, April 28, 2011 at 5:52 PM, LukÃ¡Å¡ VlÄek wrote:

Shay,

I think those filters could be useful. For example if I am interested in emails or URLs and I am not interested in the stuffed text content inbetween. Just an idea for nice-to-have feature.

Lukas

On Thu, Apr 28, 2011 at 4:06 PM, Shay Banon shay.banon@elasticsearch.com wrote:

There isn't a built in EmailFilter, which is used in the test just to make sure the relevant token type is used.
On Thursday, April 28, 2011 at 3:48 PM, LukÃ¡Å¡ VlÄek wrote:

I was probably not clear, what I was asking about is if there is any option how to configure email filter. This means you have a text which contains several email addresses and the output would be only those email addresses. That is what that Lucene test does.

On Thu, Apr 28, 2011 at 2:41 PM, Clinton Gormley clinton@iannounce.co.uk wrote:

Correct URL for Path Hierarchy:

Elasticsearch Platform — Find real-time answers at scale | Elastic

Yes it is:

Elasticsearch Platform — Find real-time answers at scale | Elastic
Elasticsearch Platform — Find real-time answers at scale | Elastic

clint

Topic		Replies	Views
[ANN] Experimental Highlighter 0.0.13 released Elasticsearch	2	340	July 6, 2017
[ANN]] experimental highlighter v0.0.5 released Elasticsearch	1	318	July 6, 2017
Email Analyzer failing in 0.16.0 Elasticsearch	2	288	July 6, 2017
Did ES / Lucene start tokenizing fields differently in 0.17.0? Elasticsearch	2	234	July 6, 2017
Issue with higlighting and analyzed tokens Elasticsearch	2	335	July 6, 2017

Highlight output changed in 0.16

Related topics