Elastic search 5.0 percolate: Is there way to percolate on pdf attachment?


(Shradha Bhalla) #1

Here is what I did-
DELETE testpdf
PUT /testpdf
{
"mappings": {
"doctype": {
"properties": {
"message": {
"type": "attachment",
"fields": {
"content": {
"type": "text",
"term_vector":"with_positions_offsets",
"store": true
}
}
}
}
},
"queries": {
"properties": {
"query": {
"type": "percolator"
}
}
}
}
}

GET /testpdf/_mapping

So far so good. It lets me create index with message as type attachment and query as type percolator.
When I try to put query it gives Error

PUT /testpdf/queries/1?refresh
{"query" : {
"match" : {
"message" : "Medicaid Fraud"
}
}
}

Reading more through the Elasticsearch and percolator in 5.0 document I realize the document is not stored in 5.0 the way I set it up in my index.

My question is how can I achieve percolating a pdf text? Is it possible in Elasticsearch? If not what are the alternatives?
I setup a pipeline and ingested pdf to an index with attachment. So I have attachment.content available .

Thanks.


(Shradha Bhalla) #2

I have the percolator work with document in another index. My test-
PUT /action-index
{
"mappings": {
"doctype": {
"properties": {
"message": {
"type": "text"
}
}
},
"queries": {
"properties": {
"query": {
"type": "percolator"
}
}
}
}
}

PUT /action-index/queries/A2?refresh
{
"query" : {
"match" : {
"message" : "Its here in Dallas, TX"
}
}
}

Created a document -
PUT /action-index/message/2
{
"message" : "Does it have Dallas in the text"
}

And now percolate reading the document from this index-
GET /action-index/_search
{
"query" : {
"percolate" : {
"field": "query",
"document_type" : "doctype",
"index" : "action-index",
"type" : "message",
"id" : "2",
"version" : 1
}
}
}
And it works like a charm. Thanks to clear documented at
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html

Now I want to extend that to pick pdf content that I indexed in elastic search

GET /action-index/_search
{
"query" : {
"percolate" : {
"field": "query",
"document_type" : "doctype",
"index" : "myindex",
"type" : "my-order",
"id" : "AVhGnETdtR0OofGzo8Jo",
"fields": "attachment.content",
"version" : 1
}
}
}

It doesn't give error but never finds a document Hits:0
The text that I am looking to percolate is in field attachment.content. I wonder does it need to be mentioned anywhere in the search above.
Any thoughts???


(David Pilato) #3

Please format your code using </> icon. It will make your post more readable.

That's a very interesting use case. thanks for bringing this up.

Here is what I think. First of all, percolator is now a query. _search endpoint does not expose ingest parameter. Percolate query neither.

I doubt we would implement that actually as it would make a little sense to slow down the query processing by processing documents first.

But on the other end, as I said, the use case is interesting indeed. Let's say that I want to classify my binary documents...

So here is a proposal to do this:

  • Register an ingest pipeline using attachment processor
  • Call the ingest _simulate endpoint
  • Extract from the response the enriched document
  • Use percolate query with this document

And you should be done.

WDYT?


(Shradha Bhalla) #4

Thanks David for your response.
Actually I solved it and would post the clean solution in a bit.
In a nutshell I was using "fields": "attachment.content" in
GET /action-index/_search whereas in percolator query I was giving "message"
By replacing "message" in query to "attachment.content" I got it working.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.