Percolating documents containing PDF attachments

Adam_Georgiou · January 25, 2013, 8:29pm

I have an index setup in which we store documents containing PDF
attachments, via the TIKA plugin. The index works great, in terms of
indexing and searching on documents. However, when trying to run documents
of similar type through the percolator index, queries referencing text in
the attachment don't match, even when they normally return documents if ran
as searches.

For example, we store this query in the percolator index:

{"query": {"constant_score": {"filter": {"and": [{"terms": {"author":
["george"]}}, {"query": {"match_phrase": {"ATTACHMENT": "company"}}}]}}}}

In summary, I want documents containing the word "george" in the author
field and containing the word "company" within the text of the attachment.
Using this query against the document index returns documents. Storing it
in the percolator index, and running documents against it does not. In
fact, none of the queries referencing the attachment match.

I'm thinking, perhaps, there's some sort of secondary mapping I need to set
up for the _percolator index, in order to tell it to decode the attachment
and parse it, but can't find any documentation alluding to it. It's also
just a wild guess, as the setup I keep describing is purely a sandbox my
team is using to evaluate elasticsearch. In other words, we don't really
know what we're doing... at least not yet!

Does anyone have any idea as to what might be happening?

-Adam

--

Adam_Georgiou · January 28, 2013, 2:48pm

I'm guessing this might have gotten buried, having been posted late Friday
afternoon.

Givin' it a one-time-only, beginning of the week bump and I won't buy you
guys (for a while)...

-Adam

On Friday, January 25, 2013 3:29:18 PM UTC-5, Adam Georgiou wrote:

I have an index setup in which we store documents containing PDF
attachments, via the TIKA plugin. The index works great, in terms of
indexing and searching on documents. However, when trying to run documents
of similar type through the percolator index, queries referencing text in
the attachment don't match, even when they normally return documents if ran
as searches.

For example, we store this query in the percolator index:

{"query": {"constant_score": {"filter": {"and": [{"terms": {"author":
["george"]}}, {"query": {"match_phrase": {"ATTACHMENT": "company"}}}]}}}}

In summary, I want documents containing the word "george" in the author
field and containing the word "company" within the text of the attachment.
Using this query against the document index returns documents. Storing it
in the percolator index, and running documents against it does not. In
fact, none of the queries referencing the attachment match.

I'm thinking, perhaps, there's some sort of secondary mapping I need to
set up for the _percolator index, in order to tell it to decode the
attachment and parse it, but can't find any documentation alluding to it.
It's also just a wild guess, as the setup I keep describing is purely a
sandbox my team is using to evaluate elasticsearch. In other words, we
don't really know what we're doing... at least not yet!

Does anyone have any idea as to what might be happening?

-Adam

Adam_Georgiou · January 28, 2013, 3:31pm

EDIT:

and I won't bother* you guys for a while...

sorry for the mistype

On Monday, January 28, 2013 9:48:04 AM UTC-5, Adam Georgiou wrote:

I'm guessing this might have gotten buried, having been posted late Friday
afternoon.

Givin' it a one-time-only, beginning of the week bump and I won't buy you
guys (for a while)...

-Adam

On Friday, January 25, 2013 3:29:18 PM UTC-5, Adam Georgiou wrote:

I have an index setup in which we store documents containing PDF
attachments, via the TIKA plugin. The index works great, in terms of
indexing and searching on documents. However, when trying to run documents
of similar type through the percolator index, queries referencing text in
the attachment don't match, even when they normally return documents if ran
as searches.

For example, we store this query in the percolator index:

{"query": {"constant_score": {"filter": {"and": [{"terms": {"author":
["george"]}}, {"query": {"match_phrase": {"ATTACHMENT": "company"}}}]}}}}

In summary, I want documents containing the word "george" in the author
field and containing the word "company" within the text of the attachment.
Using this query against the document index returns documents. Storing it
in the percolator index, and running documents against it does not. In
fact, none of the queries referencing the attachment match.

I'm thinking, perhaps, there's some sort of secondary mapping I need to
set up for the _percolator index, in order to tell it to decode the
attachment and parse it, but can't find any documentation alluding to it.
It's also just a wild guess, as the setup I keep describing is purely a
sandbox my team is using to evaluate elasticsearch. In other words, we
don't really know what we're doing... at least not yet!

Does anyone have any idea as to what might be happening?

-Adam

karmi · January 29, 2013, 7:46am

In summary, I want documents containing the word "george" in the author
field and containing the word "company" within the text of the attachment.
Using this query against the document index returns documents. Storing it
in the percolator index, and running documents against it does not. In
fact, none of the queries referencing the attachment match.

I'm thinking, perhaps, there's some sort of secondary mapping I need to
set up for the _percolator index, in order to tell it to decode the
attachment and parse it, but can't find any documentation alluding to it.

Never used percolator with attachments, but if you need to configure the
settings/mappings for the _percolator index, try deleting it, and
creating it with custom settings/mappings just as any other index. (I'm
actually not sure the configuration "sticks", but it would be a good
starting point).

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Elastic search 5.0 percolate: Is there way to percolate on pdf attachment? Elasticsearch	4	906	December 12, 2016
Percolate query alternative for given use case Elasticsearch	1	756	December 6, 2019
Heisenbug with Percolator Elasticsearch	3	337	July 6, 2017
Attachment Plugin doesn't index fulltext of PDF with not embedded fonts Elasticsearch	4	842	July 5, 2017
How to match words already indexed? Elasticsearch	5	671	April 17, 2019

Percolating documents containing PDF attachments

Related topics