Unable to Find Document, Searching Contents - Mapper Attachments Plugin

Could someone please try and point out what I am missing / doing wrong?

My environment details:

Java 1.8.0u45
ElasticSearch 2.1.0
ElasticSearch Mapper Attachments 3.1.0

Here is a snippet from when I start up ES - mapper-attachments loaded:

[2016-04-27 12:31:14,849][INFO ][node                     ] [Trevor Fitzroy] version[2.1.0], pid[3978], build[72cd1f1/2015-11-18T22:40:03Z]
[2016-04-27 12:31:14,851][INFO ][node                     ] [Trevor Fitzroy] initializing ...
[2016-04-27 12:31:16,358][INFO ][plugins                  ] [Trevor Fitzroy] loaded [mapper-attachments], sites []
[2016-04-27 12:31:16,453][INFO ][env                      ] [Trevor Fitzroy] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [59.9gb], net total_space [111.8gb], spins? [unknown], types [hfs]
[2016-04-27 12:31:22,901][INFO ][node                     ] [Trevor Fitzroy] initialized
[2016-04-27 12:31:22,902][INFO ][node                     ] [Trevor Fitzroy] starting ...
[2016-04-27 12:31:23,157][INFO ][transport                ] [Trevor Fitzroy] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}, {[fe80::1]:9300}, {[::1]:9300}

Here is the mapping I've created:

{
	"settings": {
		"number_of_shards": 1
	},
	"mappings": {
		"document": {
			"properties": {
				"tags": {
					"analyzer": "snowball",
					"type": "string"
				},
				"rank": {
					"analyzer": "keyword",
					"type": "string"
				},
				"upload_date": {
					"analyzer": "keyword",
					"type": "string"
				},
				"document_contents": {
					"type": "attachment"
				}
			}
		}
	}
}

The document contents is the following ( testing.txt ):

I am testing the mapper plugin lets see ..

Here is my put request:

{
	"id": 9,
	"tags": "order, april, testing",
	"rank": 1,
	"upload_date": "2016-04-27T02:18:23.974Z",
	"document_contents": {
		"_indexed_chars": -1,
		"_content": "SSBhbSB0ZXN0aW5nIHRoZSBtYXBwZXIgcGx1Z2luIGxldHMgc2VlIC4uCg==\n"
	}
}

And finally my query:

{
	"query": {
		"filtered": {
			"query": {
				"match": {
					"document_contents.content": "testing"
				}
			},
			"filter": {
				"range": {
					"rank": {
						"gte": "1"
					}
				}
			}
		}
	},
	"sort": [{
		"upload_date": {
			"order": "desc"
		}
	}]
}

Here is the Mapper Attachments doc using 'query match' to search the document contents:

Any help is much appreciated.
Thank you.

What if you search in document_contents field?

Hi David,

Thanks for your input, but no cigar :disappointed:

Request:

{
	"query": {
		"filtered": {
			"query": {
				"match": {
					"document_contents": "testing"
				}
			},
			"filter": {
				"range": {
					"rank": {
						"gte": "1"
					}
				}
			}
		}
	},
	"sort": [{
		"upload_date": {
			"order": "desc"
		}
	}]
}

Response - same as for 'document_contents.content':

{
	"took": 3,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"failed": 0
	},
	"hits": {
		"total": 0,
		"max_score": null,
		"hits": []
	}
}

However, if I search on 'tags' - Request:

{
	"query": {
		"filtered": {
			"query": {
				"match": {
					"tags": "april"
				}
			},
			"filter": {
				"range": {
					"rank": {
						"gte": "1"
					}
				}
			}
		}
	},
	"sort": [{
		"upload_date": {
			"order": "desc"
		}
	}]
}

Response - searching on tags:

{
	"took": 14,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"failed": 0
	},
	"hits": {
		"total": 1,
		"max_score": null,
		"hits": [{
			"_index": "documents-1-index",
			"_type": "documents-1-type",
			"_id": "9",
			"_score": null,
			"_source": {
				"id": 9,
				"tags": "order, april, testing",
				"rank": 1,
				"upload_date": "2016-04-27T07:13:22.417Z",
				"document_contents": {
					"_indexed_chars": -1,
					"_content": "SSBhbSB0ZXN0aW5nIHRoZSBtYXBwZXIgcGx1Z2luIGxldHMgc2VlIC4uCg==\n"
				}
			},
			"sort": ["2016-04-27T07:13:22.417Z"]
		}]
	}
}

I can see that the document is there, just can't search on content.
Should I raise a bug on github for 'elasticsearch-mapper-attachments' ?

Thanks for trying to help me with that.. :slight_smile:

Further to that I have clean installed the following:

ElasticSearch-2.1.2 
ElasticSearch-Mapper-Attachments-3.1.2

But still no luck, raised a GitHub issue.

After a good night's sleep I've figured out the problem! :smile:

I was posting the document to:
http://localhost:9200/documents/1

Instead of posting to:
http://localhost:9200/documents/document/1

So embarrassing; anyways, thanks! :blush:

:slight_smile:

That's why it's always better IMO to give always the exact commands you send.

Your scripts were really complete though. I just did not find time yet to run them :frowning:

That's great you solve it. Thanks for the follow up.

I'm pretty stoked myself too! :smiley:

Although, I have ended up downgrading to ES-2.0.2
Thats because the ES-ATT-3.0.4 supports way more 'file types'

Looking forward to ES-5 and the new Ingest plugin though
Great work! Thanks :wink:

Please report file types you need to support.
We decided at some point to reduce the surface of the plugin because of jar hell issues and security concerns (security manager was doing his job).

We might be able to add support for some other files.

Ingest works the same way.

Here is my wish list:

HTML / XML
Microsoft Office Documents
Open Document Formats
iWorks
PDF
RTF
Text Formats
Mail Formats
Compressed Archives ( containing any of the above )

Super awesome to have:

Images ( metadata and OCR [ flag for OCR {default on} ] )
Details on its dependency on 'tesseract' and good doc about it.

Thanks for your help and collaboration.

Cheers, Gui

I know that some of them are working fine, like PDF oOo, MsOffice...

If you have some example files which are not working, could you add them on pastebin for example, create an issue in elasticsearch repo and link to them?

There have been discussions about OCR already and sadly it needs nowadays a 3rd party tool. No pure Java lib around AFAIK.

Hi David,

I have given a go using ES-2.0.2 and ES-ATT-3.0.4 and all types that are on my wish list worked fine :smile:
Only exception is the Image OCR text extraction that I didn't manage to get going.

According to this post ES-1.7.2 with ES-ATT-2.7.1 is capable of doing the Image OCR text extraction:

However, there is some fiddling around that needs doing to get tika / tesseract working fine (adjusting permissions from what I understood) - it would be great to have a 'how to' on that. Also, from the looks of it I can't use tika / tesseract and ES-2.x due to security measures / limitations :weary:

I have found this Java Lib (OCR) that seems to do some of the job::
https://github.com/axet/lookup
Perhaps you guys could fork it and improve..

Anyways, thanks! :smiley: