How highlighting actually works?

Hi all,

I have some specific requirements for highlighting. I need to search in
full content of item for phrase, and then show on which page searched
phrase is occuring. So i've created one field named text_content and fields
named text_content_{page_number} (text_content_1, text_content_2, etc.).
Example query is:
{
"highlight": {
"fields": {
"text_content_*": {}
}
},
"query": {
"match": {
"text_content": "lorem"
}
},
"size": 40
}

I've noticed that this query is fast, but only if i have small number of
documents in index. Quiering for documents is always fast (<40ms), but
highlight phase time is growing when number of documents in index is
growing.
I've stared thinking that highlighting may be processed before appending
"size": 40 - on the all matched documents. It's correct? How can in speed
up such case?

Regards,
Karol

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b8354eb3-3a75-4999-a180-6493240eb0cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Highlighting is complex and more hacky than you'd imagine at first glance.
Each highlighter is different and we can't tell which one you are using
without seeing your mapping. For the plain highlighter the cost is roughly
proportional to the length of the highlighted field. So in your case its
the cost to reanalyze every one of those pages.

You could return which page is matched pretty cheaply if you were willing
to write a plugin. Especially if you just wanted to know the first page or
something.

You could try using explain if you searched for text_content_*. That'd
tell you which field matched.

Nik
On Jan 18, 2015 6:21 PM, "Karol Sikora" sicarrots@gmail.com wrote:

Hi all,

I have some specific requirements for highlighting. I need to search in
full content of item for phrase, and then show on which page searched
phrase is occuring. So i've created one field named text_content and fields
named text_content_{page_number} (text_content_1, text_content_2, etc.).
Example query is:
{
"highlight": {
"fields": {
"text_content_*": {}
}
},
"query": {
"match": {
"text_content": "lorem"
}
},
"size": 40
}

I've noticed that this query is fast, but only if i have small number of
documents in index. Quiering for documents is always fast (<40ms), but
highlight phase time is growing when number of documents in index is
growing.
I've stared thinking that highlighting may be processed before appending
"size": 40 - on the all matched documents. It's correct? How can in speed
up such case?

Regards,
Karol

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b8354eb3-3a75-4999-a180-6493240eb0cc%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b8354eb3-3a75-4999-a180-6493240eb0cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd221YctsJE3QrkqnffjXACNzcZ5WaiuR1Ucrr0DV_U_NA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank for your answer. I'm probably posted too few details, here is better
description:

I'm using postings highlighter, but also checked plain and fvh - both was
remarkable slower in my case.
Fields text_content* are mapped through dynamic template:
"dynamic_templates": [
{
"text_content": {
"match": "text_content*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"analyzer": "polish",
"index_options": "offsets"
}
}
}
]
}

polish analyzer is defined as follow (using this plugin:
GitHub - monterail/elasticsearch-analysis-morfologik: Morfologik (Polish) Analysis Plugin for ElasticSearch which
provides morfologik_stem token filter):
"analyzer": {
"polish": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"morfologik_stem"
]
}
}
Pure quering in text_content is always very fast - tooks <40ms.
Total amount of time for executing request is increasing when number of
matched documents grows (more items added to index).
So i've stared thinking that highlighter is working for all of matched
documents, not only for items requested by current request (start and size
parameters). It's correct? There is some way to speed up such case (forcing
to highlight only in requested window of documents?).

Karol

2015-01-19 1:31 GMT+01:00 Nikolas Everett nik9000@gmail.com:

Highlighting is complex and more hacky than you'd imagine at first glance.
Each highlighter is different and we can't tell which one you are using
without seeing your mapping. For the plain highlighter the cost is roughly
proportional to the length of the highlighted field. So in your case its
the cost to reanalyze every one of those pages.

You could return which page is matched pretty cheaply if you were willing
to write a plugin. Especially if you just wanted to know the first page or
something.

You could try using explain if you searched for text_content_*. That'd
tell you which field matched.

Nik
On Jan 18, 2015 6:21 PM, "Karol Sikora" sicarrots@gmail.com wrote:

Hi all,

I have some specific requirements for highlighting. I need to search in
full content of item for phrase, and then show on which page searched
phrase is occuring. So i've created one field named text_content and fields
named text_content_{page_number} (text_content_1, text_content_2, etc.).
Example query is:
{
"highlight": {
"fields": {
"text_content_*": {}
}
},
"query": {
"match": {
"text_content": "lorem"
}
},
"size": 40
}

I've noticed that this query is fast, but only if i have small number of
documents in index. Quiering for documents is always fast (<40ms), but
highlight phase time is growing when number of documents in index is
growing.
I've stared thinking that highlighting may be processed before appending
"size": 40 - on the all matched documents. It's correct? How can in speed
up such case?

Regards,
Karol

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b8354eb3-3a75-4999-a180-6493240eb0cc%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b8354eb3-3a75-4999-a180-6493240eb0cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/FzSTLVWyok8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd221YctsJE3QrkqnffjXACNzcZ5WaiuR1Ucrr0DV_U_NA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd221YctsJE3QrkqnffjXACNzcZ5WaiuR1Ucrr0DV_U_NA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAN8rAyJC38RPkxZTxb0tvX1UcsW4mtO_f1tBRrpQ3ssSQQaXHA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.