Howto: Access Character Offset of term in string field

Dear All,

I just started to play with elasticsearch and must say the usability is
quite nice. However I really need a feature that seems not so trivial to
access. Given my mapping

curl -XGET 'http://localhost:9200/articles/'
curl -XPUT 'http://localhost:9200/articles/article/_mapping' -d '{
"articles" : {
"properties" : {
"source_type" : {"type": "string", "store": "yes"},
"source_id" : {"type": "string", "store": "yes"},
"body" : {"type": "string", "store": "yes", "term_vector":
"with_positions_offsets"},
"numbers": {"type": "boolean"}
}
}
}
'

The details of this mapping are not important only that I set
"term_vector": "with_positions_offsets" on the body field. I would like to
access the explicit offsets of search terms in a string field to calculate
a score. Please also note questions


and
http://stackoverflow.com/questions/15072806/elasticsearch-get-offsets-of-highlighted-snippets .
I think marking the raw text with a special token as pre_tag/post_tag to
determine the position on the client sounds rather backwards if all I need
is the explicit offset. Ideally one could access it in a script_field
evaluation.

Please tell me if I am missing something obvious. If it is something
elasticsearch just hack in on a regular basis, could someone point me in
the right direction?

Best,
Max.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dear All

I made some progress here by writing a plugin. I found [1] and [2] very
useful here. Would like to share once I cleaned this up a bit. Now my only
problem left is that the offsets from newly indexed documents do not show
up until I restart elasticsearch, (i.e. reopen all readers). Since I have
only little experience with elasticsearch: Could someone please describe
whether it is possible to get real-time or near-real-time termvector/offset
information in the plugin scope in a similarly automatic way as it work for
search/get?

Best,
Max

[1] http://jfarrell.github.io/
[2]
http://jprante.github.io/lessons/2012/03/27/Writing-a-simple-plugin-for-Elasticsearch.html

Am Freitag, 10. Mai 2013 16:12:27 UTC+2 schrieb Max Hoffmann:

Dear All,

I just started to play with elasticsearch and must say the usability is
quite nice. However I really need a feature that seems not so trivial to
access. Given my mapping

curl -XGET 'http://localhost:9200/articles/'
curl -XPUT 'http://localhost:9200/articles/article/_mapping' -d '{
"articles" : {
"properties" : {
"source_type" : {"type": "string", "store": "yes"},
"source_id" : {"type": "string", "store": "yes"},
"body" : {"type": "string", "store": "yes", "term_vector":
"with_positions_offsets"},
"numbers": {"type": "boolean"}
}
}
}
'

The details of this mapping are not important only that I set
"term_vector": "with_positions_offsets" on the body field. I would like to
access the explicit offsets of search terms in a string field to calculate
a score. Please also note questions
http://stackoverflow.com/questions/15072806/elasticsearch-get-offsets-of-highlighted-snippets
and
http://stackoverflow.com/questions/15072806/elasticsearch-get-offsets-of-highlighted-snippets .
I think marking the raw text with a special token as pre_tag/post_tag to
determine the position on the client sounds rather backwards if all I need
is the explicit offset. Ideally one could access it in a script_field
evaluation.

Please tell me if I am missing something obvious. If it is something
elasticsearch just hack in on a regular basis, could someone point me in
the right direction?

Best,
Max.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

My blog post is very old :frowning: so no longer up to date. I assume you are
on ES 0.90 (Lucene 4), so have you tried to access the reader with
getLiveDocs()?

http://lucene.apache.org/core/4_0_0-BETA/MIGRATE.html

--> DocsEnum docsEnum = reader.termDocsEnum(reader.getLiveDocs(), field,
text, needsFreqs);

Likewise for DocsAndPositionsEnum.

Jörg

Am 29.05.13 00:09, schrieb Max Hoffmann:

I made some progress here by writing a plugin. I found [1] and [2]
very useful here. Would like to share once I cleaned this up a bit.
Now my only problem left is that the offsets from newly indexed
documents do not show up until I restart elasticsearch, (i.e. reopen
all readers). Since I have only little experience with elasticsearch:
Could someone please describe whether it is possible to get real-time
or near-real-time termvector/offset information in the plugin scope in
a similarly automatic way as it work for search/get?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Am Mittwoch, 29. Mai 2013 00:28:45 UTC+2 schrieb Jörg Prante:

My blog post is very old :frowning: so no longer up to date. I assume you are

Well, worked for me so far :-). Thanks.

on ES 0.90 (Lucene 4), so have you tried to access the reader with

getLiveDocs()?

http://lucene.apache.org/core/4_0_0-BETA/MIGRATE.html

--> DocsEnum docsEnum = reader.termDocsEnum(reader.getLiveDocs(), field,
text, needsFreqs);

So, as far as I understood the documentation, reader.getLiveDocs() can only
help to filter out deleted documents from a reader but not add newly
indexed documents to an existing reader. Also it seems that elasticsearch
0.90 uses lucene 4.2 where the interface for termDocsEnum has changed[1].

Regardless, I am mostly wondering right now, whether it is possible to
reopen a reader (if index changed) from the environment where the
Transport*Action.java is run.

Best,
Max.

[1] http://lucene.apache.org/core/4_2_0/core/index.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.