Percolating documents containing PDF attachments

I have an index setup in which we store documents containing PDF
attachments, via the TIKA plugin. The index works great, in terms of
indexing and searching on documents. However, when trying to run documents
of similar type through the percolator index, queries referencing text in
the attachment don't match, even when they normally return documents if ran
as searches.

For example, we store this query in the percolator index:

{"query": {"constant_score": {"filter": {"and": [{"terms": {"author":
["george"]}}, {"query": {"match_phrase": {"ATTACHMENT": "company"}}}]}}}}

In summary, I want documents containing the word "george" in the author
field and containing the word "company" within the text of the attachment.
Using this query against the document index returns documents. Storing it
in the percolator index, and running documents against it does not. In
fact, none of the queries referencing the attachment match.

I'm thinking, perhaps, there's some sort of secondary mapping I need to set
up for the _percolator index, in order to tell it to decode the attachment
and parse it, but can't find any documentation alluding to it. It's also
just a wild guess, as the setup I keep describing is purely a sandbox my
team is using to evaluate elasticsearch. In other words, we don't really
know what we're doing... at least not yet!

Does anyone have any idea as to what might be happening?

-Adam

--

I'm guessing this might have gotten buried, having been posted late Friday
afternoon.

Givin' it a one-time-only, beginning of the week bump and I won't buy you
guys (for a while)...

-Adam

On Friday, January 25, 2013 3:29:18 PM UTC-5, Adam Georgiou wrote:

I have an index setup in which we store documents containing PDF
attachments, via the TIKA plugin. The index works great, in terms of
indexing and searching on documents. However, when trying to run documents
of similar type through the percolator index, queries referencing text in
the attachment don't match, even when they normally return documents if ran
as searches.

For example, we store this query in the percolator index:

{"query": {"constant_score": {"filter": {"and": [{"terms": {"author":
["george"]}}, {"query": {"match_phrase": {"ATTACHMENT": "company"}}}]}}}}

In summary, I want documents containing the word "george" in the author
field and containing the word "company" within the text of the attachment.
Using this query against the document index returns documents. Storing it
in the percolator index, and running documents against it does not. In
fact, none of the queries referencing the attachment match.

I'm thinking, perhaps, there's some sort of secondary mapping I need to
set up for the _percolator index, in order to tell it to decode the
attachment and parse it, but can't find any documentation alluding to it.
It's also just a wild guess, as the setup I keep describing is purely a
sandbox my team is using to evaluate elasticsearch. In other words, we
don't really know what we're doing... at least not yet!

Does anyone have any idea as to what might be happening?

-Adam

EDIT:

and I won't bother* you guys for a while...

sorry for the mistype

On Monday, January 28, 2013 9:48:04 AM UTC-5, Adam Georgiou wrote:

I'm guessing this might have gotten buried, having been posted late Friday
afternoon.

Givin' it a one-time-only, beginning of the week bump and I won't buy you
guys (for a while)...

-Adam

On Friday, January 25, 2013 3:29:18 PM UTC-5, Adam Georgiou wrote:

I have an index setup in which we store documents containing PDF
attachments, via the TIKA plugin. The index works great, in terms of
indexing and searching on documents. However, when trying to run documents
of similar type through the percolator index, queries referencing text in
the attachment don't match, even when they normally return documents if ran
as searches.

For example, we store this query in the percolator index:

{"query": {"constant_score": {"filter": {"and": [{"terms": {"author":
["george"]}}, {"query": {"match_phrase": {"ATTACHMENT": "company"}}}]}}}}

In summary, I want documents containing the word "george" in the author
field and containing the word "company" within the text of the attachment.
Using this query against the document index returns documents. Storing it
in the percolator index, and running documents against it does not. In
fact, none of the queries referencing the attachment match.

I'm thinking, perhaps, there's some sort of secondary mapping I need to
set up for the _percolator index, in order to tell it to decode the
attachment and parse it, but can't find any documentation alluding to it.
It's also just a wild guess, as the setup I keep describing is purely a
sandbox my team is using to evaluate elasticsearch. In other words, we
don't really know what we're doing... at least not yet!

Does anyone have any idea as to what might be happening?

-Adam

In summary, I want documents containing the word "george" in the author
field and containing the word "company" within the text of the attachment.
Using this query against the document index returns documents. Storing it
in the percolator index, and running documents against it does not. In
fact, none of the queries referencing the attachment match.

I'm thinking, perhaps, there's some sort of secondary mapping I need to
set up for the _percolator index, in order to tell it to decode the
attachment and parse it, but can't find any documentation alluding to it.

Never used percolator with attachments, but if you need to configure the
settings/mappings for the _percolator index, try deleting it, and
creating it with custom settings/mappings just as any other index. (I'm
actually not sure the configuration "sticks", but it would be a good
starting point).

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.