Index specific keywords instead of whole document in Elasticsearch

I have requirement that, I have a document(word document, text file, pdf etc..) and and I need to index only specified keywords like particular names, places, dates, and some keywords, instead of whole document. because we have some memory constrains.

Eg:

lets say I have a document about united states(https://en.wikipedia.org/wiki/United_States). If I index this document in elasticsearch, It should index only names, places, dates, some keywords instead of whole document. So that I will query with these keyword in kibana which results this document.

I don't know this is possible in ES or not.

please give suggestion.

thank you

I think you want to do something like entity recognition which is not available in elasticsearch by default.

But you can have a look at what @spinscale wrote as a plugin:

Might help you.

Thank you for your quick reply.

you got me wrong.
Here the situation is I will provide a list of words to ES then index only those words from the document, which is to be indexed. when i query with those keywords, kibana should result the document not whole content of the document( mean to say location of document or at least it should say "this document has that key word" ).

Thank you

I see.
So you probably want this: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keep-words-tokenfilter.html

Yes, I am trying to do the same. Bu t when I am trying to index document I am getting error as below!

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

Sorry for that..
I have tried this

<
PUT /keep_words_example2
{
"settings" : {
"analysis" : {
"analyzer" : {
"example_2" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "words_in_file"]
}
},
"filter" : {
"words_in_file" : {
"type" : "keep",
"keep_words_path" : "analysis/example_word_list.txt"
}
}
}
}
}

and My key words file is a text file with each keyword is separated by a new line.

<
POST keep_words_example2/doc1/1

{
"data":"India, officially the Republic of India (Bhārat Gaṇarājya) is a country in South Asia. It is the seventh-largest country by area, the second-most populous country (with over 1.2 billion people), and the most populous democracy in the world. It is bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast. It shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the northeast; and Myanmar (Burma) and Bangladesh to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives. Indias diverse culture. Much of the north fell to the Delhi sultanate; the south was united under the Vijayanagara Empire. The economy expanded in the 17th century in the Mughal Empire.In the mid-18th century, the subcontinent came under British East India Company rule, and in the mid-19th under British crown rule. A nationalist movement emerged in the late 19th century, which later, under Mahatma Gandhi, was noted for nonviolent resistance and led to Indias sixth largest by nominal GDP and third largest by purchasing power parity. Following market-based economic reforms in 1991, India became one of the fastest-growing major economies and is considered a newly industrialised country. However, it continues to face the challenges of poverty, corruption, malnutrition, and inadequate public healthcare. A nuclear weapons state and regional power, it has the third largest standing army in the world and ranks fifth in military expenditure among nations. India is a federal republic governed under a parliamentary system and consists of 29 states and 7 union territories. It is a pluralistic, multilingual and multi-ethnic society and is also home to a diversity of wildlife in a variety of protected habitats"
}

but I am getting error as below

<
{
"error": {
"root_cause": [
{
"type": "parse_exception",
"reason": "request body is required"
}
],
"type": "parse_exception",
"reason": "request body is required"
},
"status": 400
}

some where I have read that we need to encode it to base64, even i have tried same but getting same error.

any solutions is really helps me.

You have an empty line between the url and the content. Try removing this.

Sorry for that..
I have tried this

<
PUT /keep_words_example2
{
"settings" : {
"analysis" : {
"analyzer" : {
"example_2" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "words_in_file"]
}
},
"filter" : {
"words_in_file" : {
"type" : "keep",
"keep_words_path" : "analysis/example_word_list.txt"
}
}
}
}
}

and My key words file is a text file with each keyword is separated by a new line.

<
POST keep_words_example2/doc1/1

{
"data":"India, officially the Republic of India (Bhārat Gaṇarājya) is a country in South Asia. It is the seventh-largest country by area, the second-most populous country (with over 1.2 billion people), and the most populous democracy in the world. It is bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast. It shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the northeast; and Myanmar (Burma) and Bangladesh to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives. Indias diverse culture. Much of the north fell to the Delhi sultanate; the south was united under the Vijayanagara Empire. The economy expanded in the 17th century in the Mughal Empire.In the mid-18th century, the subcontinent came under British East India Company rule, and in the mid-19th under British crown rule. A nationalist movement emerged in the late 19th century, which later, under Mahatma Gandhi, was noted for nonviolent resistance and led to Indias sixth largest by nominal GDP and third largest by purchasing power parity. Following market-based economic reforms in 1991, India became one of the fastest-growing major economies and is considered a newly industrialised country. However, it continues to face the challenges of poverty, corruption, malnutrition, and inadequate public healthcare. A nuclear weapons state and regional power, it has the third largest standing army in the world and ranks fifth in military expenditure among nations. India is a federal republic governed under a parliamentary system and consists of 29 states and 7 union territories. It is a pluralistic, multilingual and multi-ethnic society and is also home to a diversity of wildlife in a variety of protected habitats"
}

but I am getting error as below

<
{
"error": {
"root_cause": [
{
"type": "parse_exception",
"reason": "request body is required"
}
],
"type": "parse_exception",
"reason": "request body is required"
},
"status": 400
}

some where I have read that we need to encode it to base64, even i have tried same but getting same error.

any solutions is really helps me.

You still have an empty line there. I copied this and was able to index the record from Console after having removed the empty line.

thank you for your response.

I cross checked many times that i am not having any empty line in the above text.

are you able to create the index after creating filter or with out filter?

If it giving results the below query

GET /_search
{
"query": {
"match_all": {}
}
}
should show the file not the whole file content

please give the text that you are able to index.

and is that right format to index like ===> "data":"text" or only "text"

Put this mapping setting while you create new index:

{
    	"mappings": {
    		"doc1": {
    			"properties": {
    				"data": {
    					"type": "text",
    					"analyzer": "example_2"
    				}
    			}
    		}
    	},
    	"settings": {
    		"analysis": {
    			"analyzer": {
    				"example_2": {
    					"tokenizer": "standard",
    					"filter": ["standard",
    					"lowercase",
    					"words_in_file"]
    				}
    			},
    			"filter": {
    				"words_in_file": {
    					"type": "keep",
    					"keep_words_path": "analysis/example_word_list.txt"
    				}
    			}
    		}
    	}
    }

example_word_list.txt lets say contains:

india
empire

Now, Create this doc: http://localhost:9200/yourindex/doc1 with your data above.

Check the terms that are indexed by doing this:

http://localhost:9200/yourindex/doc1/<doc_id>/_termvectors?fields=data

1 Like

thank you for your response.

It is giving all occurrence of tokens. but when I search that token it should show that this particular file is having that token.

basically I am having a directory with different files. those files to be indexed using fscrawler , and my means of any code. so if i search with key word then it should show that this file is having this key word.

I am trying in a way that fscrawler with this mapping settings in _settings.json and index all files with this settings and search with these key words so that it show only the location of file that is having this key word not whole file content.

please give me any solution which match my scenario.

The keywords in the file is just a filter. You wont get the contents of keyword file.
The data you have indexed is the Country info in DATA field. ES has indexed only the words that you have specified in keywords file.
That means if you try to query with your keywords, the whole document in ES would be returned.

I am not able to follow you fully, the way you have framed the sentences. Either i am not really understanding what you want to convey, or you are trying to implement something, without fully understanding how ES-Indexing works :slight_smile:

thank you for your reply. I need to work more on how ES index the data. anyway

I have two use cases.

use case1: Index only specific key word

I have TBs of data files which are having specific key words in it. so when I index these documents, index only these specific key words not whole the document. and when I query with particular key word the result should be the file path not whole content of document.

use case2: should not save whole document in ES

I have these TBs of documents which are big in size. So when I index this, only the key words in it should be indexed and stored in ES not whole TBs files because it consumes so much of memory.i,e the whole content of document should not be indexed and stored in ES except the key words.

please suggest me a solution for my use cases.

thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.