Thanks for getting back to me Chris. Thought I was going to be on my own. I
have added two files under the package org.elasticsearch.search.fetch.terms
called TermsFetchSubPhase.java & TermsParseElement.java. They are literally
just copies of ExplainFetchSubPhase.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainFetchSubPhase.java&
ExplainParseElement.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/fetch/explain/ExplainParseElement.java
respectively
from org.elasticsearch.search.fetch.explain.
In TermsParseElement.java, all I have changed (besides the name) is line
35. I added a terms method to SearchContext and I call that instead of
explain. This is obviously working because adding a "terms":true to a query
triggers the TermsFetchSubPhase class. In that class, all I really changed
besides replacing string occurrences of "explain" to "terms" on lines 40 &
63 was the line that does the actual work, line 61. I changed it to:
hitContext.hit().terms(hitContext.reader().getTermFreqVector(hitContext.hit().docId(),
"_all"));
That is, set the terms (I added the setter) on the InternalSearchHit
instance by fetching the term frequency vector for the field "_all" from
the IndexReader. I would of though it was pretty straight forward but
obviously not. The exception happens on this line, but only on some
documents (obviously 1 exception is enough to cause the query to fail).
As I've said previously I have ensured the term vector is stored for the
_all field by using the following mapping:
war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}
And my query is as follows:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {},
"terms": true
}
I could only think of two possible causes of error. The first was after
reading thishttp://www.gossamer-threads.com/lists/lucene/java-user/42071?do=post_view_threaded#42071mailing
list thread. To guard against that problem (which I didn't think
could possibly be the problem given the way the hitExecute method is called),
I changed my config to have only 1 shard and no replicas before re-indexing
my documents and the problem was still occurring. My next best guess is
that the way I am indexing the documents is causing the term vectors not to
be stored. I index in bulk using the following code:
BulkRequestBuilder bulkRequest = client.prepareBulk();
for (XContentBuilder json : jsons) {
bulkRequest.add(client.prepareIndex(name, type).setSource(json));
}
BulkResponse bulkResponse = bulkRequest.execute().actionGet();
I can't see why that would be an issue. If it is causing an issue then I
guess it would be a bug. Is there any tools I can use to read my index
files and verify that the term vectors are being stored? In the interest of
completeness I have (tried to at least) attached TermsFetchSubPhase.java.
If that isn't enough let me know and I can push my whole repo to Git plus
documents I am using for testing and indexing code.
Thanks for your help.
Cheers
On Wed, Nov 7, 2012 at 9:39 PM, Chris Male gento0nz@gmail.com wrote:
Hi Ryan,
I've scratched my head a little at this error. I suspect the problem
maybe in and around the code you're using to access the term vectors. Are
you able to provide a little more code?
On Tuesday, November 6, 2012 5:00:09 PM UTC+11, Ryan Stuart wrote:
Looking deeper into it, it seems the call to getTermFeqVector works
sometimes but not others. For example, it works with this document:
{
- _index: articles
- _type: war
- _id: C5rciA4LQkmY-AQd_qH5RQ
- _score: 2.6162434
- _source: {
- text: How many gears does a French tank have?
- file: All_Ordered_Reports.txt
}
}
But fails with this one:
{
- _index: articles
- _type: war
- _id: VwxcmVABTaqxtsr8RHJ9pw
- _score: 2.287412
- _source: {
- text: A Republican Guard tank brigade also arrived from
Fallujah, west of the city.
- file: All_Ordered_Reports.txt
}
}
Quickly running out of ideas here. My query is:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "tank"
}
}
],
"must_not": ,
"should":
}
},
"from": 0,
"size": 50,
"sort": ,
"facets": {}
}
And my mapping file now is:
{
"war":{
"_all": {"enabled":true, "term_vector":"with_positions",
"store":true},
"properties":{
"text":{
"type":"string",
"term_vector":"with_positions",
"store": true
},
"file":{
"type":"string",
"index":"no",
"include_in_all":false
}
}
}
}
Cheers
On Tuesday, 6 November 2012 15:37:40 UTC+10, Ryan Stuart wrote:
Just to correct myself slightly, when it doesn't work, it does have two
fields by they aren't "text" and "file", they are "_source" and "_uid" as
below.
[stored,binary,omitNorms<_**source:[B@6e843edc>,
stored,indexed,tokenized,**omitNorms<_uid:war#**NDhOAdRwQX2T8i5R0OMtkg>]
Cheers
On Tuesday, 6 November 2012 14:53:02 UTC+10, Ryan Stuart wrote:
Hi All,
As previously posted I am trying to augment elasticsearch to fit with
some research being conducted at a University. That research is based
around dynamic schema search. The way it operates is as follows:
- Submit regular TFIDF search to es.
- Get results from es. Results need to include the TermVector for
each document, the total number of documents in the index and the term
frequencies across the index.
- Build the schema for the search and submit that schema as a flat
list of words back to es.
- Get the results from ES, score them and rank them.
It was decided that the easiest way to proceed with this was to exist
at the highest level possible. That means we are using the REST interface
to try and conduct our research. For this to work I need to augment the
REST interface for search to return the term vector for each hit. I also
need to augment the REST interface for indicies to return the term
frequencies. Right now I am having trouble with the term vectors.
For the time being, I am assuming that all queries will be conducted
agains the _all field. To make sure that all the _all fields store their
term vectors, I override the default mapping as follows:
{
"default" : {
"all": {"enabled":true, "term_vector":"with_positions**
offsets"}
}
}
I made it so that the term list for a hit is only return if the search
contains "terms": true. To do this, I copied the design in
org.elasticsearch.search.**fetch.explain. I created my own terms
package and copied the classes across from explain. The *hitExecute *of
my *TermsFetchSubPhase *class does the following:
hitContext.hit().terms(hitContext.reader().
getTermFreqVector(hitContext.**hit().docId(), "_all"));
The problem is that this line throws an exception. The exception is:
Caused by: java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/Users/**/Workspace/java//
elasticsearch/data/elasticsearch/nodes/0/indices/
articles/0/index/_a.tvx")
at org.apache.lucene.store.BufferedIndexInput.refill(
BufferedIndexInput.java:264)
at org.apache.lucene.store.BufferedIndexInput.readByte(
BufferedIndexInput.java:40)
at org.apache.lucene.store.**DataInput.readInt(DataInput.**java:86)
at org.apache.lucene.store.BufferedIndexInput.readInt(
BufferedIndexInput.java:179)
at org.apache.lucene.store.**DataInput.readLong(DataInput.**java:130)
at org.apache.lucene.store.BufferedIndexInput.readLong(
BufferedIndexInput.java:192)
at org.apache.lucene.index.TermVectorsReader.get(
TermVectorsReader.java:227)
at org.apache.lucene.index.TermVectorsReader.get(
TermVectorsReader.java:281)
at org.apache.lucene.index.**SegmentReader.getTermFreqVector(
SegmentReader.java:747)
at org.elasticsearch.search.fetch.terms.
TermsFetchSubPhase.hitExecute(**TermsFetchSubPhase.java:61)
... 8 more
(I have augmented the path slightly with ***s)
The weird thing is at one point this was working briefly (after a
re-index) but then it stopped working. I have deleted the index I was
working with and re-indexed it multiple times. The mapping for the index is:
war: {
properties: {
text: {
"type": "string"
}
file: {
"type": "string"
}
}
"_all": {
"term_vector": "with_positions_offsets"
}
}
That is a copy and past from the head plugin. Any idea's what is going
on here? Am I doing the right thing to make sure the term vectors are
stored? Am I trying to access them the wrong way? The one time I got it to
work, when I looked at hitContext.doc() in a debugger, each document had
one field which was "_all". When I look now, each document has two fields
which are "text" and "file" but no "_all".
Cheers
--
--
Ryan Stuart, B.Eng
Software Engineer
ABN: 81-206-082-133
E: ryan@stuart.id.au
M: +61-431-299-036
--