_analyze on multiple documents

Sorry for the Korean values that could confuse you. But it wouldn't be that hard to understand when you are reading this.

I am using the index named sliced_data and it his milions of documents in it.
I am using Kibana and have set Mecab_Ko(Korean tokenizer, analyzer) as analyzer.
Analyzer is working fine so when I run the command below

POST /sliced_data/_analyze
{
  "analyzer": "korean",
  "text": "꽃을든남자"
}

These are the results

{
"tokens": [
{
"token": "꽃을",
"start_offset": 0,
"end_offset": 2,
"type": "EOJEOL",
"position": 0
},
{
"token": "꽃",
"start_offset": 0,
"end_offset": 1,
"type": "NNG",
"position": 0
},
{
"token": "든",
"start_offset": 2,
"end_offset": 3,
"type": "INFLECT",
"position": 1
},
{
"token": "들/VV",
"start_offset": 2,
"end_offset": 3,
"type": "VV",
"position": 1
},
{
"token": "남자",
"start_offset": 3,
"end_offset": 5,
"type": "NNG",
"position": 2
}
]
}

I want to collect the tokens that have "NNG" as the value of "type".

So this means that I have to _analyze milions of texts to get the result i want.

It would take a long time to run the query million times.

Is there anyway that elasticsearch provides to _analyze multiple documents?

I found a way to analyze an array of text like below, but it would be hard to paste all the texts since I have millions of data.

POST /sliced_data/_analyze
{
"analyzer": "korean",
"text": ["꽃을 든 남자", "초보개발자", "blashhs", "blahblah", "blahblahblahblah"]
}

Is there a good solution??

Thank you.

There is no easier way to do it, you will have to run batches of documents through the analyze API by passing an array to text like you did in your last example.

1 Like

Is there any way to pass an array to "text" without typing it??
For now I made an array that has all the strings inside it. And I am analyzing it one by one.
I have to run _analyze million times if i have million documents to analyze it.
Is there a way to pass an array to "text"?

Thanks.

Are you only collect "NNG" takens in your collection?
How about using keep_type filter with reindex?
It is only keep term that has types you specified. So, you can get terms from the field with keep_type.

I tried using keep_type.

PUT /extras
{
"settings" : {
"analysis":{
"analyzer":{
"korean":{
"type":"custom",
"tokenizer":"mecab_ko_standard_tokenizer",
"filter" : ["erase_noise"]
}
},
"tokenizer": "mecab_ko_standard_tokenizer",
"filter" : {
"erase_noise" : {
"type" : "keep_types",
"types" : [ "NNG" ]
}
}
}
},
"mappings": {
"product_details": {
"properties": {
"message": {
"type": "text",
"analyzer": "korean",
"search_analyzer": "korean"
}
}
}
}
}
This is what I tried.
I think that if I can get the list of all the tokens in the index there will be no problem.
But I don't know how.
Is there a method that I can see all the tokens in the index???

Thanks.

Is it simillar to this?

POST _reindex
{
"source": {
"index": "sliced_data"
},
"query":{
"filter" : {
"erase_noise" : {
"type" : "keep_types",
"types" : [ "NNG" ]
}
}
},
"dest": {
"index": "all_nng"
}
}

please give me an advice.

Thanks.

You can get terms with terms aggregation.
And you can get all terms using https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.