_analyze on multiple documents

seanlyu · November 21, 2017, 2:00am

Sorry for the Korean values that could confuse you. But it wouldn't be that hard to understand when you are reading this.

I am using the index named sliced_data and it his milions of documents in it.
I am using Kibana and have set Mecab_Ko(Korean tokenizer, analyzer) as analyzer.
Analyzer is working fine so when I run the command below

POST /sliced_data/_analyze
{
  "analyzer": "korean",
  "text": "꽃을든남자"
}

These are the results

{
"tokens": [
{
"token": "꽃을",
"start_offset": 0,
"end_offset": 2,
"type": "EOJEOL",
"position": 0
},
{
"token": "꽃",
"start_offset": 0,
"end_offset": 1,
"type": "NNG",
"position": 0
},
{
"token": "든",
"start_offset": 2,
"end_offset": 3,
"type": "INFLECT",
"position": 1
},
{
"token": "들/VV",
"start_offset": 2,
"end_offset": 3,
"type": "VV",
"position": 1
},
{
"token": "남자",
"start_offset": 3,
"end_offset": 5,
"type": "NNG",
"position": 2
}
]
}

I want to collect the tokens that have "NNG" as the value of "type".

So this means that I have to _analyze milions of texts to get the result i want.

It would take a long time to run the query million times.

Is there anyway that elasticsearch provides to _analyze multiple documents?

I found a way to analyze an array of text like below, but it would be hard to paste all the texts since I have millions of data.

POST /sliced_data/_analyze
{
"analyzer": "korean",
"text": ["꽃을 든 남자", "초보개발자", "blashhs", "blahblah", "blahblahblahblah"]
}

Is there a good solution??

Thank you.

jpountz · November 21, 2017, 1:50pm

There is no easier way to do it, you will have to run batches of documents through the analyze API by passing an array to text like you did in your last example.

seanlyu · November 22, 2017, 12:41am

Is there any way to pass an array to "text" without typing it??
For now I made an array that has all the strings inside it. And I am analyzing it one by one.
I have to run _analyze million times if i have million documents to analyze it.
Is there a way to pass an array to "text"?

Thanks.

johtani · November 22, 2017, 6:42am

Are you only collect "NNG" takens in your collection?
How about using keep_type filter with reindex?
It is only keep term that has types you specified. So, you can get terms from the field with keep_type.

seanlyu · November 22, 2017, 7:35am

I tried using keep_type.

PUT /extras
{
"settings" : {
"analysis":{
"analyzer":{
"korean":{
"type":"custom",
"tokenizer":"mecab_ko_standard_tokenizer",
"filter" : ["erase_noise"]
}
},
"tokenizer": "mecab_ko_standard_tokenizer",
"filter" : {
"erase_noise" : {
"type" : "keep_types",
"types" : [ "NNG" ]
}
}
}
},
"mappings": {
"product_details": {
"properties": {
"message": {
"type": "text",
"analyzer": "korean",
"search_analyzer": "korean"
}
}
}
}
}
This is what I tried.
I think that if I can get the list of all the tokens in the index there will be no problem.
But I don't know how.
Is there a method that I can see all the tokens in the index???

Thanks.

seanlyu · November 23, 2017, 12:16pm

Is it simillar to this?

POST _reindex
{
"source": {
"index": "sliced_data"
},
"query":{
"filter" : {
"erase_noise" : {
"type" : "keep_types",
"types" : [ "NNG" ]
}
}
},
"dest": {
"index": "all_nng"
}
}

please give me an advice.

Thanks.

johtani · November 26, 2017, 12:34pm

You can get terms with terms aggregation.
And you can get all terms using https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions

system · December 24, 2017, 12:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Analyzer in Kibana Elasticsearch	5	386	June 6, 2018
Indexing PDF's and Perform Text Analytics with ES Elasticsearch	12	3656	October 9, 2018
Need help with ES Query Elasticsearch	1	318	July 6, 2017
Combo analyzer - Issue with English and Japanese text being stored in same fields Elasticsearch	5	1720	July 6, 2017
Need suggestions on type of query to be used for a given analysis for better results? Elasticsearch	2	373	July 6, 2017

_analyze on multiple documents

Related topics