Term_vector access

Hi All,

As previously posted I am trying to augment elasticsearch to fit with some
research being conducted at a University. That research is based around
dynamic schema search. The way it operates is as follows:

  1. Submit regular TFIDF search to es.
  2. Get results from es. Results need to include the TermVector for each
    document, the total number of documents in the index and the term
    frequencies across the index.
  3. Build the schema for the search and submit that schema as a flat list
    of words back to es.
  4. Get the results from ES, score them and rank them.

It was decided that the easiest way to proceed with this was to exist at
the highest level possible. That means we are using the REST interface to
try and conduct our research. For this to work I need to augment the REST
interface for search to return the term vector for each hit. I also need to
augment the REST interface for indicies to return the term frequencies.
Right now I am having trouble with the term vectors.

For the time being, I am assuming that all queries will be conducted agains
the _all field. To make sure that all the _all fields store their term
vectors, I override the default mapping as follows:

{
"default" : {
"_all": {"enabled":true, "term_vector":"with_positions_offsets"}
}
}

I made it so that the term list for a hit is only return if the search
contains "terms": true. To do this, I copied the design in
org.elasticsearch.search.fetch.explain. I created my own terms package and
copied the classes across from explain. The *hitExecute *of my *TermsFetchSubPhase
*class does the following:

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.hit().docId(),
"_all"));

The problem is that this line throws an exception. The exception is:

Caused by: java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/Users/****/Workspace/java/**/elasticsearch/data/elasticsearch/nodes/0/indices/articles/0/index/_a.tvx")
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:264)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:40)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:86)
at
org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:179)
at org.apache.lucene.store.DataInput.readLong(DataInput.java:130)
at
org.apache.lucene.store.BufferedIndexInput.readLong(BufferedIndexInput.java:192)
at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:227)
at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:281)
at
org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:747)
at
org.elasticsearch.search.fetch.terms.TermsFetchSubPhase.hitExecute(TermsFetchSubPhase.java:61)
... 8 more
(I have augmented the path slightly with ***s)

The weird thing is at one point this was working briefly (after a re-index)
but then it stopped working. I have deleted the index I was working with
and re-indexed it multiple times. The mapping for the index is:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

That is a copy and past from the head plugin. Any idea's what is going on
here? Am I doing the right thing to make sure the term vectors are stored?
Am I trying to access them the wrong way? The one time I got it to work,
when I looked at hitContext.doc() in a debugger, each document had one
field which was "_all". When I look now, each document has two fields which
are "text" and "file" but no "_all".

Cheers

--

Just to correct myself slightly, when it doesn't work, it does have two
fields by they aren't "text" and "file", they are "_source" and "_uid" as
below.

[stored,binary,omitNorms<_source:[B@6e843edc>, stored,indexed,tokenized,omitNorms<_uid:war#NDhOAdRwQX2T8i5R0OMtkg>]

Cheers

On Tuesday, 6 November 2012 14:53:02 UTC+10, Ryan Stuart wrote:

Hi All,

As previously posted I am trying to augment elasticsearch to fit with some
research being conducted at a University. That research is based around
dynamic schema search. The way it operates is as follows:

  1. Submit regular TFIDF search to es.
  2. Get results from es. Results need to include the TermVector for
    each document, the total number of documents in the index and the term
    frequencies across the index.
  3. Build the schema for the search and submit that schema as a flat
    list of words back to es.
  4. Get the results from ES, score them and rank them.

It was decided that the easiest way to proceed with this was to exist at
the highest level possible. That means we are using the REST interface to
try and conduct our research. For this to work I need to augment the REST
interface for search to return the term vector for each hit. I also need to
augment the REST interface for indicies to return the term frequencies.
Right now I am having trouble with the term vectors.

For the time being, I am assuming that all queries will be conducted
agains the _all field. To make sure that all the _all fields store their
term vectors, I override the default mapping as follows:

{
"default" : {
"_all": {"enabled":true, "term_vector":"with_positions_offsets"}
}
}

I made it so that the term list for a hit is only return if the search
contains "terms": true. To do this, I copied the design in
org.elasticsearch.search.fetch.explain. I created my own terms package and
copied the classes across from explain. The *hitExecute *of my *TermsFetchSubPhase
*class does the following:

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.hit().docId(),
"_all"));

The problem is that this line throws an exception. The exception is:

Caused by: java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/Users/****/Workspace/java/**/elasticsearch/data/elasticsearch/nodes/0/indices/articles/0/index/_a.tvx")
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:264)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:40)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:86)
at
org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:179)
at org.apache.lucene.store.DataInput.readLong(DataInput.java:130)
at
org.apache.lucene.store.BufferedIndexInput.readLong(BufferedIndexInput.java:192)
at
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:227)
at
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:281)
at
org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:747)
at
org.elasticsearch.search.fetch.terms.TermsFetchSubPhase.hitExecute(TermsFetchSubPhase.java:61)
... 8 more
(I have augmented the path slightly with ***s)

The weird thing is at one point this was working briefly (after a
re-index) but then it stopped working. I have deleted the index I was
working with and re-indexed it multiple times. The mapping for the index is:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

That is a copy and past from the head plugin. Any idea's what is going on
here? Am I doing the right thing to make sure the term vectors are stored?
Am I trying to access them the wrong way? The one time I got it to work,
when I looked at hitContext.doc() in a debugger, each document had one
field which was "_all". When I look now, each document has two fields which
are "text" and "file" but no "_all".

Cheers

--

Looking deeper into it, it seems the call to getTermFeqVector works
sometimes but not others. For example, it works with this document:

{

  • _index: articles
  • _type: war
  • _id: C5rciA4LQkmY-AQd_qH5RQ
  • _score: 2.6162434
  • _source: {
    • text: How many gears does a French tank have?
    • file: All_Ordered_Reports.txt
      }

}

But fails with this one:

{

  • _index: articles
  • _type: war
  • _id: VwxcmVABTaqxtsr8RHJ9pw
  • _score: 2.287412
  • _source: {
    • text: A Republican Guard tank brigade also arrived from Fallujah,
      west of the city.
    • file: All_Ordered_Reports.txt
      }

}

Quickly running out of ideas here. My query is:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {}
}

And my mapping file now is:

{
"war":{
"_all": {"enabled":true, "term_vector":"with_positions",
"store":true},
"properties":{
"text":{
"type":"string",
"term_vector":"with_positions",
"store": true
},
"file":{
"type":"string",
"index":"no",
"include_in_all":false
}
}
}
}

Cheers

On Tuesday, 6 November 2012 15:37:40 UTC+10, Ryan Stuart wrote:

Just to correct myself slightly, when it doesn't work, it does have two
fields by they aren't "text" and "file", they are "_source" and "_uid" as
below.

[stored,binary,omitNorms<_source:[B@6e843edc>, stored,indexed,tokenized,omitNorms<_uid:war#NDhOAdRwQX2T8i5R0OMtkg>]

Cheers

On Tuesday, 6 November 2012 14:53:02 UTC+10, Ryan Stuart wrote:

Hi All,

As previously posted I am trying to augment elasticsearch to fit with
some research being conducted at a University. That research is based
around dynamic schema search. The way it operates is as follows:

  1. Submit regular TFIDF search to es.
  2. Get results from es. Results need to include the TermVector for
    each document, the total number of documents in the index and the term
    frequencies across the index.
  3. Build the schema for the search and submit that schema as a flat
    list of words back to es.
  4. Get the results from ES, score them and rank them.

It was decided that the easiest way to proceed with this was to exist at
the highest level possible. That means we are using the REST interface to
try and conduct our research. For this to work I need to augment the REST
interface for search to return the term vector for each hit. I also need to
augment the REST interface for indicies to return the term frequencies.
Right now I am having trouble with the term vectors.

For the time being, I am assuming that all queries will be conducted
agains the _all field. To make sure that all the _all fields store their
term vectors, I override the default mapping as follows:

{
"default" : {
"_all": {"enabled":true, "term_vector":"with_positions_offsets"}
}
}

I made it so that the term list for a hit is only return if the search
contains "terms": true. To do this, I copied the design in
org.elasticsearch.search.fetch.explain. I created my own terms package and
copied the classes across from explain. The *hitExecute *of my *TermsFetchSubPhase
*class does the following:

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.hit().docId(),
"_all"));

The problem is that this line throws an exception. The exception is:

Caused by: java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/Users/****/Workspace/java/**/elasticsearch/data/elasticsearch/nodes/0/indices/articles/0/index/_a.tvx")
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:264)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:40)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:86)
at
org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:179)
at org.apache.lucene.store.DataInput.readLong(DataInput.java:130)
at
org.apache.lucene.store.BufferedIndexInput.readLong(BufferedIndexInput.java:192)
at
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:227)
at
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:281)
at
org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:747)
at
org.elasticsearch.search.fetch.terms.TermsFetchSubPhase.hitExecute(TermsFetchSubPhase.java:61)
... 8 more
(I have augmented the path slightly with ***s)

The weird thing is at one point this was working briefly (after a
re-index) but then it stopped working. I have deleted the index I was
working with and re-indexed it multiple times. The mapping for the index is:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

That is a copy and past from the head plugin. Any idea's what is going on
here? Am I doing the right thing to make sure the term vectors are stored?
Am I trying to access them the wrong way? The one time I got it to work,
when I looked at hitContext.doc() in a debugger, each document had one
field which was "_all". When I look now, each document has two fields which
are "text" and "file" but no "_all".

Cheers

--

Hi Ryan,

I've scratched my head a little at this error. I suspect the problem maybe
in and around the code you're using to access the term vectors. Are you
able to provide a little more code?

On Tuesday, November 6, 2012 5:00:09 PM UTC+11, Ryan Stuart wrote:

Looking deeper into it, it seems the call to getTermFeqVector works
sometimes but not others. For example, it works with this document:

{

  • _index: articles
  • _type: war
  • _id: C5rciA4LQkmY-AQd_qH5RQ
  • _score: 2.6162434
  • _source: {
    • text: How many gears does a French tank have?
    • file: All_Ordered_Reports.txt
      }

}

But fails with this one:

{

  • _index: articles
  • _type: war
  • _id: VwxcmVABTaqxtsr8RHJ9pw
  • _score: 2.287412
  • _source: {
    • text: A Republican Guard tank brigade also arrived from
      Fallujah, west of the city.
    • file: All_Ordered_Reports.txt
      }

}

Quickly running out of ideas here. My query is:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {}
}

And my mapping file now is:

{
"war":{
"_all": {"enabled":true, "term_vector":"with_positions",
"store":true},
"properties":{
"text":{
"type":"string",
"term_vector":"with_positions",
"store": true
},
"file":{
"type":"string",
"index":"no",
"include_in_all":false
}
}
}
}

Cheers

On Tuesday, 6 November 2012 15:37:40 UTC+10, Ryan Stuart wrote:

Just to correct myself slightly, when it doesn't work, it does have two
fields by they aren't "text" and "file", they are "_source" and "_uid" as
below.

[stored,binary,omitNorms<_source:[B@6e843edc>, stored,indexed,tokenized,omitNorms<_uid:war#NDhOAdRwQX2T8i5R0OMtkg>]

Cheers

On Tuesday, 6 November 2012 14:53:02 UTC+10, Ryan Stuart wrote:

Hi All,

As previously posted I am trying to augment elasticsearch to fit with
some research being conducted at a University. That research is based
around dynamic schema search. The way it operates is as follows:

  1. Submit regular TFIDF search to es.
  2. Get results from es. Results need to include the TermVector for
    each document, the total number of documents in the index and the term
    frequencies across the index.
  3. Build the schema for the search and submit that schema as a flat
    list of words back to es.
  4. Get the results from ES, score them and rank them.

It was decided that the easiest way to proceed with this was to exist at
the highest level possible. That means we are using the REST interface to
try and conduct our research. For this to work I need to augment the REST
interface for search to return the term vector for each hit. I also need to
augment the REST interface for indicies to return the term frequencies.
Right now I am having trouble with the term vectors.

For the time being, I am assuming that all queries will be conducted
agains the _all field. To make sure that all the _all fields store their
term vectors, I override the default mapping as follows:

{
"default" : {
"_all": {"enabled":true, "term_vector":"with_positions_offsets"}
}
}

I made it so that the term list for a hit is only return if the search
contains "terms": true. To do this, I copied the design in
org.elasticsearch.search.fetch.explain. I created my own terms package and
copied the classes across from explain. The *hitExecute *of my *TermsFetchSubPhase
*class does the following:

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.hit().docId(),
"_all"));

The problem is that this line throws an exception. The exception is:

Caused by: java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/Users/****/Workspace/java/**/elasticsearch/data/elasticsearch/nodes/0/indices/articles/0/index/_a.tvx")
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:264)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:40)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:86)
at
org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:179)
at org.apache.lucene.store.DataInput.readLong(DataInput.java:130)
at
org.apache.lucene.store.BufferedIndexInput.readLong(BufferedIndexInput.java:192)
at
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:227)
at
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:281)
at
org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:747)
at
org.elasticsearch.search.fetch.terms.TermsFetchSubPhase.hitExecute(TermsFetchSubPhase.java:61)
... 8 more
(I have augmented the path slightly with ***s)

The weird thing is at one point this was working briefly (after a
re-index) but then it stopped working. I have deleted the index I was
working with and re-indexed it multiple times. The mapping for the index is:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

That is a copy and past from the head plugin. Any idea's what is going
on here? Am I doing the right thing to make sure the term vectors are
stored? Am I trying to access them the wrong way? The one time I got it to
work, when I looked at hitContext.doc() in a debugger, each document had
one field which was "_all". When I look now, each document has two fields
which are "text" and "file" but no "_all".

Cheers

--

Thanks for getting back to me Chris. Thought I was going to be on my own. I
have added two files under the package org.elasticsearch.search.fetch.terms
called TermsFetchSubPhase.java & TermsParseElement.java. They are literally
just copies of ExplainFetchSubPhase.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainFetchSubPhase.java&
ExplainParseElement.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainParseElement.java
respectively
from org.elasticsearch.search.fetch.explain.

In TermsParseElement.java, all I have changed (besides the name) is line
35. I added a terms method to SearchContext and I call that instead of
explain. This is obviously working because adding a "terms":true to a query
triggers the TermsFetchSubPhase class. In that class, all I really changed
besides replacing string occurrences of "explain" to "terms" on lines 40 &
63 was the line that does the actual work, line 61. I changed it to:

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.hit().docId(),
"_all"));

That is, set the terms (I added the setter) on the InternalSearchHit
instance by fetching the term frequency vector for the field "_all" from
the IndexReader. I would of though it was pretty straight forward but
obviously not. The exception happens on this line, but only on some
documents (obviously 1 exception is enough to cause the query to fail).

As I've said previously I have ensured the term vector is stored for the
_all field by using the following mapping:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

And my query is as follows:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {},
"terms": true
}

I could only think of two possible causes of error. The first was after
reading thishttp://www.gossamer-threads.com/lists/lucene/java-user/42071?do=post_view_threaded#42071mailing
list thread. To guard against that problem (which I didn't think
could possibly be the problem given the way the hitExecute method is called),
I changed my config to have only 1 shard and no replicas before re-indexing
my documents and the problem was still occurring. My next best guess is
that the way I am indexing the documents is causing the term vectors not to
be stored. I index in bulk using the following code:

BulkRequestBuilder bulkRequest = client.prepareBulk();
for (XContentBuilder json : jsons) {
bulkRequest.add(client.prepareIndex(name, type).setSource(json));
}
BulkResponse bulkResponse = bulkRequest.execute().actionGet();

I can't see why that would be an issue. If it is causing an issue then I
guess it would be a bug. Is there any tools I can use to read my index
files and verify that the term vectors are being stored? In the interest of
completeness I have (tried to at least) attached TermsFetchSubPhase.java.
If that isn't enough let me know and I can push my whole repo to Git plus
documents I am using for testing and indexing code.

Thanks for your help.

Cheers

On Wed, Nov 7, 2012 at 9:39 PM, Chris Male gento0nz@gmail.com wrote:

Hi Ryan,

I've scratched my head a little at this error. I suspect the problem
maybe in and around the code you're using to access the term vectors. Are
you able to provide a little more code?

On Tuesday, November 6, 2012 5:00:09 PM UTC+11, Ryan Stuart wrote:

Looking deeper into it, it seems the call to getTermFeqVector works
sometimes but not others. For example, it works with this document:

{

  • _index: articles
  • _type: war
  • _id: C5rciA4LQkmY-AQd_qH5RQ
  • _score: 2.6162434
  • _source: {
    • text: How many gears does a French tank have?
    • file: All_Ordered_Reports.txt
      }

}

But fails with this one:

{

  • _index: articles
  • _type: war
  • _id: VwxcmVABTaqxtsr8RHJ9pw
  • _score: 2.287412
  • _source: {
    • text: A Republican Guard tank brigade also arrived from
      Fallujah, west of the city.
    • file: All_Ordered_Reports.txt
      }

}

Quickly running out of ideas here. My query is:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {}
}

And my mapping file now is:

{
"war":{
"_all": {"enabled":true, "term_vector":"with_positions",
"store":true},
"properties":{
"text":{
"type":"string",
"term_vector":"with_positions"
,
"store": true
},
"file":{
"type":"string",
"index":"no",
"include_in_all":false
}
}
}
}

Cheers

On Tuesday, 6 November 2012 15:37:40 UTC+10, Ryan Stuart wrote:

Just to correct myself slightly, when it doesn't work, it does have two
fields by they aren't "text" and "file", they are "_source" and "_uid" as
below.

[stored,binary,omitNorms<_**source:[B@6e843edc>, stored,indexed,tokenized,**omitNorms<_uid:war#**NDhOAdRwQX2T8i5R0OMtkg>]

Cheers

On Tuesday, 6 November 2012 14:53:02 UTC+10, Ryan Stuart wrote:

Hi All,

As previously posted I am trying to augment elasticsearch to fit with
some research being conducted at a University. That research is based
around dynamic schema search. The way it operates is as follows:

  1. Submit regular TFIDF search to es.
  2. Get results from es. Results need to include the TermVector for
    each document, the total number of documents in the index and the term
    frequencies across the index.
  3. Build the schema for the search and submit that schema as a flat
    list of words back to es.
  4. Get the results from ES, score them and rank them.

It was decided that the easiest way to proceed with this was to exist
at the highest level possible. That means we are using the REST interface
to try and conduct our research. For this to work I need to augment the
REST interface for search to return the term vector for each hit. I also
need to augment the REST interface for indicies to return the term
frequencies. Right now I am having trouble with the term vectors.

For the time being, I am assuming that all queries will be conducted
agains the _all field. To make sure that all the _all fields store their
term vectors, I override the default mapping as follows:

{
"default" : {
"all": {"enabled":true, "term_vector":"with_positions**
offsets"}
}
}

I made it so that the term list for a hit is only return if the search
contains "terms": true. To do this, I copied the design in
org.elasticsearch.search.**fetch.explain. I created my own terms
package and copied the classes across from explain. The *hitExecute *of
my *TermsFetchSubPhase *class does the following:

hitContext.hit().terms(hitContext.reader().
getTermFreqVector(hitContext.**hit().docId(), "_all"));

The problem is that this line throws an exception. The exception is:

Caused by: java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/Users/**/Workspace/java//
elasticsearch/data/elasticsearch/nodes/0/indices/
articles/0/index/_a.tvx")
at org.apache.lucene.store.BufferedIndexInput.refill(
BufferedIndexInput.java:264)
at org.apache.lucene.store.BufferedIndexInput.readByte(
BufferedIndexInput.java:40)
at org.apache.lucene.store.**DataInput.readInt(DataInput.**java:86)
at org.apache.lucene.store.BufferedIndexInput.readInt(
BufferedIndexInput.java:179)
at org.apache.lucene.store.**DataInput.readLong(DataInput.**java:130)
at org.apache.lucene.store.BufferedIndexInput.readLong(
BufferedIndexInput.java:192)
at org.apache.lucene.index.TermVectorsReader.get(
TermVectorsReader.java:227)
at org.apache.lucene.index.TermVectorsReader.get(
TermVectorsReader.java:281)
at org.apache.lucene.index.**SegmentReader.getTermFreqVector(
SegmentReader.java:747)
at org.elasticsearch.search.fetch.terms.
TermsFetchSubPhase.hitExecute(**TermsFetchSubPhase.java:61)
... 8 more
(I have augmented the path slightly with ***s)

The weird thing is at one point this was working briefly (after a
re-index) but then it stopped working. I have deleted the index I was
working with and re-indexed it multiple times. The mapping for the index is:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

That is a copy and past from the head plugin. Any idea's what is going
on here? Am I doing the right thing to make sure the term vectors are
stored? Am I trying to access them the wrong way? The one time I got it to
work, when I looked at hitContext.doc() in a debugger, each document had
one field which was "_all". When I look now, each document has two fields
which are "text" and "file" but no "_all".

Cheers

--

--
Ryan Stuart, B.Eng
Software Engineer

ABN: 81-206-082-133
E: ryan@stuart.id.au
M: +61-431-299-036

--

Is anyone able to point me in the direction of the code that actually does
the storing of the term vectors for a document?

Cheers

On Wed, Nov 7, 2012 at 10:06 PM, Ryan Stuart ryan@stuart.id.au wrote:

Thanks for getting back to me Chris. Thought I was going to be on my own.
I have added two files under the package
org.elasticsearch.search.fetch.terms called TermsFetchSubPhase.java & TermsParseElement.java.
They are literally just copies of ExplainFetchSubPhase.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainFetchSubPhase.java&
ExplainParseElement.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainParseElement.java respectively
from org.elasticsearch.search.fetch.explain.

In TermsParseElement.java, all I have changed (besides the name) is line
35. I added a terms method to SearchContext and I call that instead of
explain. This is obviously working because adding a "terms":true to a query
triggers the TermsFetchSubPhase class. In that class, all I really changed
besides replacing string occurrences of "explain" to "terms" on lines 40 &
63 was the line that does the actual work, line 61. I changed it to:

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.hit().docId(),
"_all"));

That is, set the terms (I added the setter) on the InternalSearchHit
instance by fetching the term frequency vector for the field "_all" from
the IndexReader. I would of though it was pretty straight forward but
obviously not. The exception happens on this line, but only on some
documents (obviously 1 exception is enough to cause the query to fail).

As I've said previously I have ensured the term vector is stored for the
_all field by using the following mapping:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

And my query is as follows:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {},
"terms": true
}

I could only think of two possible causes of error. The first was after
reading thishttp://www.gossamer-threads.com/lists/lucene/java-user/42071?do=post_view_threaded#42071mailing list thread. To guard against that problem (which I didn't think
could possibly be the problem given the way the hitExecute method is called),
I changed my config to have only 1 shard and no replicas before re-indexing
my documents and the problem was still occurring. My next best guess is
that the way I am indexing the documents is causing the term vectors not to
be stored. I index in bulk using the following code:

BulkRequestBuilder bulkRequest = client.prepareBulk();
for (XContentBuilder json : jsons) {
bulkRequest.add(client.prepareIndex(name, type).setSource(json));
}
BulkResponse bulkResponse = bulkRequest.execute().actionGet();

I can't see why that would be an issue. If it is causing an issue then I
guess it would be a bug. Is there any tools I can use to read my index
files and verify that the term vectors are being stored? In the interest of
completeness I have (tried to at least) attached TermsFetchSubPhase.java.
If that isn't enough let me know and I can push my whole repo to Git plus
documents I am using for testing and indexing code.

Thanks for your help.

Cheers

On Wed, Nov 7, 2012 at 9:39 PM, Chris Male gento0nz@gmail.com wrote:

Hi Ryan,

I've scratched my head a little at this error. I suspect the problem
maybe in and around the code you're using to access the term vectors. Are
you able to provide a little more code?

On Tuesday, November 6, 2012 5:00:09 PM UTC+11, Ryan Stuart wrote:

Looking deeper into it, it seems the call to getTermFeqVector works
sometimes but not others. For example, it works with this document:

{

  • _index: articles
  • _type: war
  • _id: C5rciA4LQkmY-AQd_qH5RQ
  • _score: 2.6162434
  • _source: {
    • text: How many gears does a French tank have?
    • file: All_Ordered_Reports.txt
      }

}

But fails with this one:

{

  • _index: articles
  • _type: war
  • _id: VwxcmVABTaqxtsr8RHJ9pw
  • _score: 2.287412
  • _source: {
    • text: A Republican Guard tank brigade also arrived from
      Fallujah, west of the city.
    • file: All_Ordered_Reports.txt
      }

}

Quickly running out of ideas here. My query is:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {}
}

And my mapping file now is:

{
"war":{
"_all": {"enabled":true, "term_vector":"with_positions",
"store":true},
"properties":{
"text":{
"type":"string",
"term_vector":"with_positions"
,
"store": true
},
"file":{
"type":"string",
"index":"no",
"include_in_all":false
}
}
}
}

Cheers

On Tuesday, 6 November 2012 15:37:40 UTC+10, Ryan Stuart wrote:

Just to correct myself slightly, when it doesn't work, it does have two
fields by they aren't "text" and "file", they are "_source" and "_uid" as
below.

[stored,binary,omitNorms<_**source:[B@6e843edc>, stored,indexed,tokenized,**omitNorms<_uid:war#** NDhOAdRwQX2T8i5R0OMtkg>]

Cheers

On Tuesday, 6 November 2012 14:53:02 UTC+10, Ryan Stuart wrote:

Hi All,

As previously posted I am trying to augment elasticsearch to fit with
some research being conducted at a University. That research is based
around dynamic schema search. The way it operates is as follows:

  1. Submit regular TFIDF search to es.
  2. Get results from es. Results need to include the TermVector for
    each document, the total number of documents in the index and the term
    frequencies across the index.
  3. Build the schema for the search and submit that schema as a
    flat list of words back to es.
  4. Get the results from ES, score them and rank them.

It was decided that the easiest way to proceed with this was to exist
at the highest level possible. That means we are using the REST interface
to try and conduct our research. For this to work I need to augment the
REST interface for search to return the term vector for each hit. I also
need to augment the REST interface for indicies to return the term
frequencies. Right now I am having trouble with the term vectors.

For the time being, I am assuming that all queries will be conducted
agains the _all field. To make sure that all the _all fields store their
term vectors, I override the default mapping as follows:

{
"default" : {
"all": {"enabled":true, "term_vector":"with_positions**
offsets"}
}
}

I made it so that the term list for a hit is only return if the search
contains "terms": true. To do this, I copied the design in
org.elasticsearch.search.**fetch.explain. I created my own terms
package and copied the classes across from explain. The *hitExecute *of
my *TermsFetchSubPhase *class does the following:

hitContext.hit().terms(hitContext.reader().
getTermFreqVector(hitContext.**hit().docId(), "_all"));

The problem is that this line throws an exception. The exception is:

Caused by: java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/Users/**/Workspace/java//
elasticsearch/data/elasticsearch/nodes/0/indices/
articles/0/index/_a.tvx")
at org.apache.lucene.store.BufferedIndexInput.refill(
BufferedIndexInput.java:264)
at org.apache.lucene.store.BufferedIndexInput.readByte(
BufferedIndexInput.java:40)
at org.apache.lucene.store.**DataInput.readInt(DataInput.**java:86)
at org.apache.lucene.store.BufferedIndexInput.readInt(
BufferedIndexInput.java:179)
at org.apache.lucene.store.**DataInput.readLong(DataInput.**java:130)
at org.apache.lucene.store.BufferedIndexInput.readLong(
BufferedIndexInput.java:192)
at org.apache.lucene.index.TermVectorsReader.get(
TermVectorsReader.java:227)
at org.apache.lucene.index.TermVectorsReader.get(
TermVectorsReader.java:281)
at org.apache.lucene.index.**SegmentReader.getTermFreqVector(
SegmentReader.java:747)
at org.elasticsearch.search.fetch.terms.
TermsFetchSubPhase.hitExecute(**TermsFetchSubPhase.java:61)
... 8 more
(I have augmented the path slightly with ***s)

The weird thing is at one point this was working briefly (after a
re-index) but then it stopped working. I have deleted the index I was
working with and re-indexed it multiple times. The mapping for the index is:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

That is a copy and past from the head plugin. Any idea's what is going
on here? Am I doing the right thing to make sure the term vectors are
stored? Am I trying to access them the wrong way? The one time I got it to
work, when I looked at hitContext.doc() in a debugger, each document had
one field which was "_all". When I look now, each document has two fields
which are "text" and "file" but no "_all".

Cheers

--

--
Ryan Stuart, B.Eng
Software Engineer

--
Ryan Stuart, B.Eng
Software Engineer

--

Hey Ryan,

sorry for the late reply.
I think you need to replace

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.hit().docId(),
"_all"));
with:

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.docId(),
"_all"));

what you are doing it taking the leave reader (a segment reader that holds
a subset of the documents) and try to fetch a TermVector with a top level
ID from it. You need to use the id from the hit context instead. This
should make your problem go away!

simon

On Thursday, November 8, 2012 7:05:33 AM UTC+1, Ryan Stuart wrote:

Is anyone able to point me in the direction of the code that actually does
the storing of the term vectors for a document?

Cheers

On Wed, Nov 7, 2012 at 10:06 PM, Ryan Stuart <ry...@stuart.id.au<javascript:>

wrote:

Thanks for getting back to me Chris. Thought I was going to be on my own.
I have added two files under the package
org.elasticsearch.search.fetch.terms called TermsFetchSubPhase.java & TermsParseElement.java.
They are literally just copies of ExplainFetchSubPhase.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainFetchSubPhase.java&
ExplainParseElement.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainParseElement.java respectively
from org.elasticsearch.search.fetch.explain.

In TermsParseElement.java, all I have changed (besides the name) is line
35. I added a terms method to SearchContext and I call that instead of
explain. This is obviously working because adding a "terms":true to a query
triggers the TermsFetchSubPhase class. In that class, all I really changed
besides replacing string occurrences of "explain" to "terms" on lines 40 &
63 was the line that does the actual work, line 61. I changed it to:

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.hit().docId(),
"_all"));

That is, set the terms (I added the setter) on the InternalSearchHit
instance by fetching the term frequency vector for the field "_all" from
the IndexReader. I would of though it was pretty straight forward but
obviously not. The exception happens on this line, but only on some
documents (obviously 1 exception is enough to cause the query to fail).

As I've said previously I have ensured the term vector is stored for the
_all field by using the following mapping:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

And my query is as follows:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {},
"terms": true
}

I could only think of two possible causes of error. The first was after
reading thishttp://www.gossamer-threads.com/lists/lucene/java-user/42071?do=post_view_threaded#42071mailing list thread. To guard against that problem (which I didn't think
could possibly be the problem given the way the hitExecute method is called),
I changed my config to have only 1 shard and no replicas before re-indexing
my documents and the problem was still occurring. My next best guess is
that the way I am indexing the documents is causing the term vectors not to
be stored. I index in bulk using the following code:

BulkRequestBuilder bulkRequest = client.prepareBulk();
for (XContentBuilder json : jsons) {
bulkRequest.add(client.prepareIndex(name, type).setSource(json));
}
BulkResponse bulkResponse = bulkRequest.execute().actionGet();

I can't see why that would be an issue. If it is causing an issue then I
guess it would be a bug. Is there any tools I can use to read my index
files and verify that the term vectors are being stored? In the interest of
completeness I have (tried to at least) attached TermsFetchSubPhase.java.
If that isn't enough let me know and I can push my whole repo to Git plus
documents I am using for testing and indexing code.

Thanks for your help.

Cheers

On Wed, Nov 7, 2012 at 9:39 PM, Chris Male <gent...@gmail.com<javascript:>

wrote:

Hi Ryan,

I've scratched my head a little at this error. I suspect the problem
maybe in and around the code you're using to access the term vectors. Are
you able to provide a little more code?

On Tuesday, November 6, 2012 5:00:09 PM UTC+11, Ryan Stuart wrote:

Looking deeper into it, it seems the call to getTermFeqVector works
sometimes but not others. For example, it works with this document:

{

  • _index: articles
  • _type: war
  • _id: C5rciA4LQkmY-AQd_qH5RQ
  • _score: 2.6162434
  • _source: {
    • text: How many gears does a French tank have?
    • file: All_Ordered_Reports.txt
      }

}

But fails with this one:

{

  • _index: articles
  • _type: war
  • _id: VwxcmVABTaqxtsr8RHJ9pw
  • _score: 2.287412
  • _source: {
    • text: A Republican Guard tank brigade also arrived from
      Fallujah, west of the city.
    • file: All_Ordered_Reports.txt
      }

}

Quickly running out of ideas here. My query is:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {}
}

And my mapping file now is:

{
"war":{
"_all": {"enabled":true, "term_vector":"with_positions",
"store":true},
"properties":{
"text":{
"type":"string",
"term_vector":"with_positions"
,
"store": true
},
"file":{
"type":"string",
"index":"no",
"include_in_all":false
}
}
}
}

Cheers

On Tuesday, 6 November 2012 15:37:40 UTC+10, Ryan Stuart wrote:

Just to correct myself slightly, when it doesn't work, it does have
two fields by they aren't "text" and "file", they are "_source" and "_uid"
as below.

[stored,binary,omitNorms<_**source:[B@6e843edc>, stored,indexed,tokenized,**omitNorms<_uid:war#** NDhOAdRwQX2T8i5R0OMtkg>]

Cheers

On Tuesday, 6 November 2012 14:53:02 UTC+10, Ryan Stuart wrote:

Hi All,

As previously posted I am trying to augment elasticsearch to fit with
some research being conducted at a University. That research is based
around dynamic schema search. The way it operates is as follows:

  1. Submit regular TFIDF search to es.
  2. Get results from es. Results need to include the TermVector
    for each document, the total number of documents in the index and the term
    frequencies across the index.
  3. Build the schema for the search and submit that schema as a
    flat list of words back to es.
  4. Get the results from ES, score them and rank them.

It was decided that the easiest way to proceed with this was to exist
at the highest level possible. That means we are using the REST interface
to try and conduct our research. For this to work I need to augment the
REST interface for search to return the term vector for each hit. I also
need to augment the REST interface for indicies to return the term
frequencies. Right now I am having trouble with the term vectors.

For the time being, I am assuming that all queries will be conducted
agains the _all field. To make sure that all the _all fields store their
term vectors, I override the default mapping as follows:

{
"default" : {
"all": {"enabled":true, "term_vector":"with_positions**
offsets"}
}
}

I made it so that the term list for a hit is only return if the
search contains "terms": true. To do this, I copied the design in
org.elasticsearch.search.**fetch.explain. I created my own terms
package and copied the classes across from explain. The *hitExecute *of
my *TermsFetchSubPhase *class does the following:

hitContext.hit().terms(hitContext.reader().
getTermFreqVector(hitContext.**hit().docId(), "_all"));

The problem is that this line throws an exception. The exception is:

Caused by: java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/Users/**/Workspace/java//
elasticsearch/data/elasticsearch/nodes/0/indices/
articles/0/index/_a.tvx")
at org.apache.lucene.store.BufferedIndexInput.refill(
BufferedIndexInput.java:264)
at org.apache.lucene.store.BufferedIndexInput.readByte(
BufferedIndexInput.java:40)
at org.apache.lucene.store.**DataInput.readInt(DataInput.**java:86)
at org.apache.lucene.store.BufferedIndexInput.readInt(
BufferedIndexInput.java:179)
at org.apache.lucene.store.DataInput.readLong(DataInput.
java:130)
at org.apache.lucene.store.BufferedIndexInput.readLong(
BufferedIndexInput.java:192)
at org.apache.lucene.index.TermVectorsReader.get(
TermVectorsReader.java:227)
at org.apache.lucene.index.TermVectorsReader.get(
TermVectorsReader.java:281)
at org.apache.lucene.index.**SegmentReader.getTermFreqVector(
SegmentReader.java:747)
at org.elasticsearch.search.fetch.terms.
TermsFetchSubPhase.hitExecute(**TermsFetchSubPhase.java:61)
... 8 more
(I have augmented the path slightly with ***s)

The weird thing is at one point this was working briefly (after a
re-index) but then it stopped working. I have deleted the index I was
working with and re-indexed it multiple times. The mapping for the index is:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

That is a copy and past from the head plugin. Any idea's what is
going on here? Am I doing the right thing to make sure the term vectors are
stored? Am I trying to access them the wrong way? The one time I got it to
work, when I looked at hitContext.doc() in a debugger, each document had
one field which was "_all". When I look now, each document has two fields
which are "text" and "file" but no "_all".

Cheers

--

--
Ryan Stuart, B.Eng
Software Engineer

--
Ryan Stuart, B.Eng
Software Engineer

--

Ah, that fixes it straight away. Nice to know I was on the right track.
Thanks for your help.

On a side note, and idea if a change like this would be accepted upstream?

Cheers

On Thu, Nov 8, 2012 at 8:03 PM, simonw simon.willnauer@elasticsearch.comwrote:

Hey Ryan,

sorry for the late reply.
I think you need to replace

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.hit().docId(),
"_all"));
with:

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.docId(),
"_all"));

what you are doing it taking the leave reader (a segment reader that holds
a subset of the documents) and try to fetch a TermVector with a top level
ID from it. You need to use the id from the hit context instead. This
should make your problem go away!

simon

On Thursday, November 8, 2012 7:05:33 AM UTC+1, Ryan Stuart wrote:

Is anyone able to point me in the direction of the code that actually
does the storing of the term vectors for a document?

Cheers

On Wed, Nov 7, 2012 at 10:06 PM, Ryan Stuart ry...@stuart.id.au wrote:

Thanks for getting back to me Chris. Thought I was going to be on my
own. I have added two files under the package org.elasticsearch.search.*
fetch.terms called TermsFetchSubPhase.java* & TermsParseElement.java.
They are literally just copies of ExplainFetchSubPhase.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainFetchSubPhase.java&
ExplainParseElement.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainParseElement.java
**respectively from org.elasticsearch.search.**fetch.explain.

In TermsParseElement.java, all I have changed (besides the name) is line
35. I added a terms method to SearchContext and I call that instead of
explain. This is obviously working because adding a "terms":true to a query
triggers the TermsFetchSubPhase class. In that class, all I really changed
besides replacing string occurrences of "explain" to "terms" on lines 40 &
63 was the line that does the actual work, line 61. I changed it to:

       hitContext.hit().terms(**hitContext.reader().**

getTermFreqVector(hitContext.**hit().docId(), "_all"));

That is, set the terms (I added the setter) on the InternalSearchHit
instance by fetching the term frequency vector for the field "_all" from
the IndexReader. I would of though it was pretty straight forward but
obviously not. The exception happens on this line, but only on some
documents (obviously 1 exception is enough to cause the query to fail).

As I've said previously I have ensured the term vector is stored for the
_all field by using the following mapping:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

And my query is as follows:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {},
"terms": true
}

I could only think of two possible causes of error. The first was after
reading thishttp://www.gossamer-threads.com/lists/lucene/java-user/42071?do=post_view_threaded#42071mailing list thread. To guard against that problem (which I didn't think
could possibly be the problem given the way the hitExecute method is called),
I changed my config to have only 1 shard and no replicas before re-indexing
my documents and the problem was still occurring. My next best guess is
that the way I am indexing the documents is causing the term vectors not to
be stored. I index in bulk using the following code:

BulkRequestBuilder bulkRequest = client.prepareBulk();
for (XContentBuilder json : jsons) {
bulkRequest.add(client.**prepareIndex(name, type).setSource(json));
}
BulkResponse bulkResponse = bulkRequest.execute().**actionGet();

I can't see why that would be an issue. If it is causing an issue then I
guess it would be a bug. Is there any tools I can use to read my index
files and verify that the term vectors are being stored? In the interest of
completeness I have (tried to at least) attached TermsFetchSubPhase.java.
If that isn't enough let me know and I can push my whole repo to Git plus
documents I am using for testing and indexing code.

Thanks for your help.

Cheers

On Wed, Nov 7, 2012 at 9:39 PM, Chris Male gent...@gmail.com wrote:

Hi Ryan,

I've scratched my head a little at this error. I suspect the problem
maybe in and around the code you're using to access the term vectors. Are
you able to provide a little more code?

On Tuesday, November 6, 2012 5:00:09 PM UTC+11, Ryan Stuart wrote:

Looking deeper into it, it seems the call to getTermFeqVector works
sometimes but not others. For example, it works with this document:

{

  • _index: articles
  • _type: war
  • _id: C5rciA4LQkmY-AQd_qH5RQ
  • _score: 2.6162434
  • _source: {
    • text: How many gears does a French tank have?
    • file: All_Ordered_Reports.txt
      }

}

But fails with this one:

{

  • _index: articles
  • _type: war
  • _id: VwxcmVABTaqxtsr8RHJ9pw
  • _score: 2.287412
  • _source: {
    • text: A Republican Guard tank brigade also arrived from
      Fallujah, west of the city.
    • file: All_Ordered_Reports.txt
      }

}

Quickly running out of ideas here. My query is:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {}
}

And my mapping file now is:

{
"war":{
"_all": {"enabled":true, "term_vector":"with_positions",
"store":true},
"properties":{
"text":{
"type":"string",
"term_vector":"with_positions"
,
"store": true
},
"file":{
"type":"string",
"index":"no",
"include_in_all":false
}
}
}
}

Cheers

On Tuesday, 6 November 2012 15:37:40 UTC+10, Ryan Stuart wrote:

Just to correct myself slightly, when it doesn't work, it does have
two fields by they aren't "text" and "file", they are "_source" and "_uid"
as below.

[stored,binary,omitNorms<_source:[B@6e843edc>, stored,indexed,tokenized,**omitN**orms<_uid:war#**NDhOAdRwQX2T8i5R** 0OMtkg>]

Cheers

On Tuesday, 6 November 2012 14:53:02 UTC+10, Ryan Stuart wrote:

Hi All,

As previously posted I am trying to augment elasticsearch to fit
with some research being conducted at a University. That research is based
around dynamic schema search. The way it operates is as follows:

  1. Submit regular TFIDF search to es.
  2. Get results from es. Results need to include the TermVector
    for each document, the total number of documents in the index and the term
    frequencies across the index.
  3. Build the schema for the search and submit that schema as a
    flat list of words back to es.
  4. Get the results from ES, score them and rank them.

It was decided that the easiest way to proceed with this was to
exist at the highest level possible. That means we are using the REST
interface to try and conduct our research. For this to work I need to
augment the REST interface for search to return the term vector for each
hit. I also need to augment the REST interface for indicies to return the
term frequencies. Right now I am having trouble with the term vectors.

For the time being, I am assuming that all queries will be conducted
agains the _all field. To make sure that all the _all fields store their
term vectors, I override the default mapping as follows:

{
"default" : {
"all": {"enabled":true, "term_vector":"with_positions****
offsets"}
}
}

I made it so that the term list for a hit is only return if the
search contains "terms": true. To do this, I copied the design in
org.elasticsearch.search.fetch.explain. I created my own terms
package and copied the classes across from explain. The *hitExecute
*of my *TermsFetchSubPhase *class does the following:

hitContext.hit().terms(hitContext.reader().getTermFreqVector
(hitContext.**hit().docId(), "_all"));

The problem is that this line throws an exception. The exception is:

Caused by: java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/Users//Workspace/java//elastics
earch/data/elasticsearch/nodes/0/indices/articles/0/
index/_a.tvx")
at org.apache.lucene.store.BufferedIndexInput.refill(

BufferedIn
dexInput.java:264)
at org.apache.lucene.store.BufferedIndexInput.readByte(

Buffered
IndexInput.java:40)
at org.apache.lucene.store.DataInput.readInt(DataInput.

java:86)
at org.apache.lucene.store.BufferedIndexInput.readInt(**
BufferedIndexInput.java:179)
at org.apache.lucene.store.DataInput.readLong(DataInput.java:
130)
at org.apache.lucene.store.BufferedIndexInput.readLong(

Buffered
IndexInput.java:192)
at org.apache.lucene.index.TermVectorsReader.get(

TermVectorsReader.java:227)
at org.apache.lucene.index.TermVectorsReader.get(

TermVectorsReader.java:281)
at org.apache.lucene.index.SegmentReader.getTermFreqVector(
Segm
entReader.java:747)
at org.elasticsearch.search.fetch.terms.TermsFetchSubPhase.
hitExecute(**TermsFetchSubPhase.**java:61)
... 8 more
(I have augmented the path slightly with ***s)

The weird thing is at one point this was working briefly (after a
re-index) but then it stopped working. I have deleted the index I was
working with and re-indexed it multiple times. The mapping for the index is:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

That is a copy and past from the head plugin. Any idea's what is
going on here? Am I doing the right thing to make sure the term vectors are
stored? Am I trying to access them the wrong way? The one time I got it to
work, when I looked at hitContext.doc() in a debugger, each document had
one field which was "_all". When I look now, each document has two fields
which are "text" and "file" but no "_all".

Cheers

--

--
Ryan Stuart, B.Eng
Software Engineer

--
Ryan Stuart, B.Eng
Software Engineer

--

--
Ryan Stuart, B.Eng
Software Engineer

ABN: 81-206-082-133
E: ryan@stuart.id.au
M: +61-431-299-036

--

Hey,

On Thursday, November 8, 2012 11:33:21 AM UTC+1, Ryan Stuart wrote:

Ah, that fixes it straight away. Nice to know I was on the right track.
Thanks for your help.

cool!

On a side note, and idea if a change like this would be accepted upstream?

I'd need to have a closer look but with a dedicated API I could imagine...

simon

Cheers

On Thu, Nov 8, 2012 at 8:03 PM, simonw <simon.w...@elasticsearch.com<javascript:>

wrote:

Hey Ryan,

sorry for the late reply.
I think you need to replace

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.hit().docId(),
"_all"));
with:

hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.docId(),
"_all"));

what you are doing it taking the leave reader (a segment reader that
holds a subset of the documents) and try to fetch a TermVector with a top
level ID from it. You need to use the id from the hit context instead. This
should make your problem go away!

simon

On Thursday, November 8, 2012 7:05:33 AM UTC+1, Ryan Stuart wrote:

Is anyone able to point me in the direction of the code that actually
does the storing of the term vectors for a document?

Cheers

On Wed, Nov 7, 2012 at 10:06 PM, Ryan Stuart ry...@stuart.id.au wrote:

Thanks for getting back to me Chris. Thought I was going to be on my
own. I have added two files under the package org.elasticsearch.search.
fetch.terms called TermsFetchSubPhase.java & TermsParseElement.java.
They are literally just copies of ExplainFetchSubPhase.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainFetchSubPhase.java&
ExplainParseElement.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainParseElement.java
**respectively from org.elasticsearch.search.**fetch.explain.

In TermsParseElement.java, all I have changed (besides the name) is
line 35. I added a terms method to SearchContext and I call that instead of
explain. This is obviously working because adding a "terms":true to a query
triggers the TermsFetchSubPhase class. In that class, all I really changed
besides replacing string occurrences of "explain" to "terms" on lines 40 &
63 was the line that does the actual work, line 61. I changed it to:

       hitContext.hit().terms(**hitContext.reader().**

getTermFreqVector(hitContext.**hit().docId(), "_all"));

That is, set the terms (I added the setter) on the InternalSearchHit
instance by fetching the term frequency vector for the field "_all" from
the IndexReader. I would of though it was pretty straight forward but
obviously not. The exception happens on this line, but only on some
documents (obviously 1 exception is enough to cause the query to fail).

As I've said previously I have ensured the term vector is stored for
the _all field by using the following mapping:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

And my query is as follows:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {},
"terms": true
}

I could only think of two possible causes of error. The first was after
reading thishttp://www.gossamer-threads.com/lists/lucene/java-user/42071?do=post_view_threaded#42071mailing list thread. To guard against that problem (which I didn't think
could possibly be the problem given the way the hitExecute method is called),
I changed my config to have only 1 shard and no replicas before re-indexing
my documents and the problem was still occurring. My next best guess
is that the way I am indexing the documents is causing the term vectors not
to be stored. I index in bulk using the following code:

BulkRequestBuilder bulkRequest = client.prepareBulk();
for (XContentBuilder json : jsons) {
bulkRequest.add(client.**prepareIndex(name, type).setSource(json));
}
BulkResponse bulkResponse = bulkRequest.execute().**actionGet();

I can't see why that would be an issue. If it is causing an issue then
I guess it would be a bug. Is there any tools I can use to read my index
files and verify that the term vectors are being stored? In the interest of
completeness I have (tried to at least) attached TermsFetchSubPhase.java.
If that isn't enough let me know and I can push my whole repo to Git plus
documents I am using for testing and indexing code.

Thanks for your help.

Cheers

On Wed, Nov 7, 2012 at 9:39 PM, Chris Male gent...@gmail.com wrote:

Hi Ryan,

I've scratched my head a little at this error. I suspect the problem
maybe in and around the code you're using to access the term vectors. Are
you able to provide a little more code?

On Tuesday, November 6, 2012 5:00:09 PM UTC+11, Ryan Stuart wrote:

Looking deeper into it, it seems the call to getTermFeqVector works
sometimes but not others. For example, it works with this document:

{

  • _index: articles
  • _type: war
  • _id: C5rciA4LQkmY-AQd_qH5RQ
  • _score: 2.6162434
  • _source: {
    • text: How many gears does a French tank have?
    • file: All_Ordered_Reports.txt
      }

}

But fails with this one:

{

  • _index: articles
  • _type: war
  • _id: VwxcmVABTaqxtsr8RHJ9pw
  • _score: 2.287412
  • _source: {
    • text: A Republican Guard tank brigade also arrived from
      Fallujah, west of the city.
    • file: All_Ordered_Reports.txt
      }

}

Quickly running out of ideas here. My query is:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {}
}

And my mapping file now is:

{
"war":{
"_all": {"enabled":true, "term_vector":"with_positions",
"store":true},
"properties":{
"text":{
"type":"string",
"term_vector":"with_positions"
,
"store": true
},
"file":{
"type":"string",
"index":"no",
"include_in_all":false
}
}
}
}

Cheers

On Tuesday, 6 November 2012 15:37:40 UTC+10, Ryan Stuart wrote:

Just to correct myself slightly, when it doesn't work, it does have
two fields by they aren't "text" and "file", they are "_source" and "_uid"
as below.

[stored,binary,omitNorms<_source:[B@6e843edc>, stored,indexed,tokenized,**omitN**orms<_uid:war#**NDhOAdRwQX2T8i5R** 0OMtkg>]

Cheers

On Tuesday, 6 November 2012 14:53:02 UTC+10, Ryan Stuart wrote:

Hi All,

As previously posted I am trying to augment elasticsearch to fit
with some research being conducted at a University. That research is based
around dynamic schema search. The way it operates is as follows:

  1. Submit regular TFIDF search to es.
  2. Get results from es. Results need to include the TermVector
    for each document, the total number of documents in the index and the term
    frequencies across the index.
  3. Build the schema for the search and submit that schema as a
    flat list of words back to es.
  4. Get the results from ES, score them and rank them.

It was decided that the easiest way to proceed with this was to
exist at the highest level possible. That means we are using the REST
interface to try and conduct our research. For this to work I need to
augment the REST interface for search to return the term vector for each
hit. I also need to augment the REST interface for indicies to return the
term frequencies. Right now I am having trouble with the term vectors.

For the time being, I am assuming that all queries will be
conducted agains the _all field. To make sure that all the _all fields
store their term vectors, I override the default mapping as follows:

{
"default" : {
"all": {"enabled":true, "term_vector":"with_positions****
offsets"}
}
}

I made it so that the term list for a hit is only return if the
search contains "terms": true. To do this, I copied the design in
org.elasticsearch.search.fetch.explain. I created my own terms
package and copied the classes across from explain. The *hitExecute
*of my *TermsFetchSubPhase *class does the following:

hitContext.hit().terms(hitContext.reader().*getTermFreqVector
*(hitContext.**hit().docId(), "_all"));

The problem is that this line throws an exception. The exception is:

Caused by: java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/Users//Workspace/java//elastics
earch/data/elasticsearch/nodes/0/indices/articles/0/
index/_a.tvx")
at org.apache.lucene.store.BufferedIndexInput.refill(

BufferedIn
dexInput.java:264)
at org.apache.lucene.store.BufferedIndexInput.readByte(

Buffered
IndexInput.java:40)
at org.apache.lucene.store.DataInput.readInt(DataInput.

java:86)
at org.apache.lucene.store.BufferedIndexInput.readInt(

BufferedIndexInput.java:179)
at org.apache.lucene.store.DataInput.readLong(DataInput.

java:130)
at org.apache.lucene.store.BufferedIndexInput.readLong(

BufferedIndexInput.java:192)
at org.apache.lucene.index.TermVectorsReader.get(

TermVectorsReader.java:227)
at org.apache.lucene.index.TermVectorsReader.get(

TermVectorsRea*der.java:281)
at org.apache.lucene.index.SegmentReader.getTermFreqVector(
*Segm
entReader.java:747)
at org.elasticsearch.search.fetch.terms.TermsFetchSubPhase.
hitExecute(**TermsFetchSubPhase.**java:61)
... 8 more
(I have augmented the path slightly with ***s)

The weird thing is at one point this was working briefly (after a
re-index) but then it stopped working. I have deleted the index I was
working with and re-indexed it multiple times. The mapping for the index is:

war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}

That is a copy and past from the head plugin. Any idea's what is
going on here? Am I doing the right thing to make sure the term vectors are
stored? Am I trying to access them the wrong way? The one time I got it to
work, when I looked at hitContext.doc() in a debugger, each document had
one field which was "_all". When I look now, each document has two fields
which are "text" and "file" but no "_all".

Cheers

--

--
Ryan Stuart, B.Eng
Software Engineer

--
Ryan Stuart, B.Eng
Software Engineer

--

--
Ryan Stuart, B.Eng
Software Engineer

ABN: 81-206-082-133
E: ry...@stuart.id.au <javascript:>
M: +61-431-299-036

--