How to use Limit Token Filter?

mathmatomancer · February 21, 2017, 4:17pm

Hi, I'm trying to use the limit token filter to cap the amount of data stored in elasticsearch from a potentially very large text string in my main database.

I have in my settings this filter and analyzer specified:

analysis": {
  "filter": {
    "max_size_tokens": {
      "type": "limit",
      "max_token_count": "50"
    }
},
"analyzer": {
  "large_text_blobs": {
    "filter": [
      "lowercase",
      "unique",          
      "max_size_tokens",
    ],
    "type": "custom",
    "tokenizer": "standard"
  }
}

but analyzing a long string is returning more than the 50 tokens I specified. (Example string is just the numbers 1 - 200, with comma and space between; it returns all 200 tokens.)

BTW this is on elasticsearch 5.2.1, running locally, and the consume_all_tokens option does not appear to have any effect on this.

mattweber · February 22, 2017, 6:05pm

The limit token filter will only affect the indexed value, not what is returned in results via the _source. If you want to limit the field in general you should look into using an ingest pipeline which would be applied before indexing.

mathmatomancer · February 22, 2017, 6:35pm

Yes, I'm not looking to change the source value. The extra tokens I'm seeing are from looking at the results of _analyze using the specified filter:

curl "http://localhost:9200/customers/_analyze?analyzer=large_text_blobs&index=customers&text=1%2C+2%2C+3%2C+4%2C+5%2C+6%2C+7%2C+8%2C+9%2C+10%2C+11%2C+12%2C+13%2C+14%2C+15%2C+16%2C+17%2C+18%2C+19%2C+20%2C+21%2C+22%2C+23%2C+24%2C+25%2C+26%2C+27%2C+28%2C+29%2C+30%2C+31%2C+32%2C+33%2C+34%2C+35%2C+36%2C+37%2C+38%2C+39%2C+40%2C+41%2C+42%2C+43%2C+44%2C+45%2C+46%2C+47%2C+48%2C+49%2C+50%2C+51%2C+52%2C+53%2C+54%2C+55%2C+56%2C+57%2C+58%2C+59%2C+60%2C+61%2C+62%2C+63%2C+64%2C+65%2C+66%2C+67%2C+68%2C+69%2C+70%2C+71%2C+72%2C+73%2C+74%2C+75%2C+76%2C+77%2C+78%2C+79%2C+80%2C+81%2C+82%2C+83%2C+84%2C+85%2C+86%2C+87%2C+88%2C+89%2C+90%2C+91%2C+92%2C+93%2C+94%2C+95%2C+96%2C+97%2C+98%2C+99%2C+100%2C+101%2C+102%2C+103%2C+104%2C+105%2C+106%2C+107%2C+108%2C+109%2C+110%2C+111%2C+112%2C+113%2C+114%2C+115%2C+116%2C+117%2C+118%2C+119%2C+120%2C+121%2C+122%2C+123%2C+124%2C+125%2C+126%2C+127%2C+128%2C+129%2C+130%2C+131%2C+132%2C+133%2C+134%2C+135%2C+136%2C+137%2C+138%2C+139%2C+140%2C+141%2C+142%2C+143%2C+144%2C+145%2C+146%2C+147%2C+148%2C+149%2C+150%2C+151%2C+152%2C+153%2C+154%2C+155%2C+156%2C+157%2C+158%2C+159%2C+160%2C+161%2C+162%2C+163%2C+164%2C+165%2C+166%2C+167%2C+168%2C+169%2C+170%2C+171%2C+172%2C+173%2C+174%2C+175%2C+176%2C+177%2C+178%2C+179%2C+180%2C+181%2C+182%2C+183%2C+184%2C+185%2C+186%2C+187%2C+188%2C+189%2C+190%2C+191%2C+192%2C+193%2C+194%2C+195%2C+196%2C+197%2C+198%2C+199%2C+200"

This returns the full list of tokens, where I'd expect only the first 50 (or really any 50, it doesn't matter for this).

{"token":"1","start_offset":0,"end_offset":1,"type":"<NUM>","position":0},
...
{"token":"199","start_offset":882,"end_offset":885,"type":"<NUM>","position":19998},
{"token":"200","start_offset":887,"end_offset":890,"type":"<NUM>","position":20099}]}

mattweber · February 22, 2017, 7:12pm

I only have a 5.2.0 instance to test with right this second but using analyze api it works:

curl -XPOST 'http://localhost:9200/_analyze' -d '{
	"tokenizer": "standard",
	"filter": ["lowercase", "unique", {"type": "limit", "max_token_count": 5}],
	"text": "1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
}'

Gives me:

{
  "tokens": [
    {
      "token": "1",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "2",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "3",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<NUM>",
      "position": 2
    },
    {
      "token": "4",
      "start_offset": 9,
      "end_offset": 10,
      "type": "<NUM>",
      "position": 3
    },
    {
      "token": "5",
      "start_offset": 12,
      "end_offset": 13,
      "type": "<NUM>",
      "position": 4
    }
  ]
}

mathmatomancer · February 22, 2017, 9:40pm

Hmm, thanks for pointing me at that, your data looks like it's from Kibana...

It looks like the curl generated by 'copy as curl' from kibana is formatted differently than my ruby gem's .analyze method does it, and for some reason that breaks it?

From Kibana (returns the expected amount of data):

curl -XPOST "http://localhost:9200/help_articles/_analyze" -d'
{
  "text": "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100",
  "analyzer": "large_text_blobs"
}'

If I use the query string, even in Kibana, it returns the full set of tokens instead of limiting it.

curl -XGET "http://localhost:9200/help_articles/_analyze?analyzer=large_text_blobs&index=help_articles&text=1%2C+2%2C+3%2C+4%2C+5%2C+6%2C+7%2C+8%2C+9%2C+10%2C+11%2C+12%2C+13%2C+14%2C+15%2C+16%2C+17%2C+18%2C+19%2C+20%2C+21%2C+22%2C+23%2C+24%2C+25%2C+26%2C+27%2C+28%2C+29%2C+30%2C+31%2C+32%2C+33%2C+34%2C+35%2C+36%2C+37%2C+38%2C+39%2C+40..."

mathmatomancer · February 23, 2017, 4:27pm

Okay, in case anyone else runs into similar problems, the query string version of this is deprecated, and while it's surprising that it doesn't work, it's not going to be supported in the future anyway.

The ruby gem does actually do either, depending on how you make the call, and the details of that are a bit hidden.

Incorrect (generates the url-params version):

@client.indices.analyze(text: 'long string', analyzer: :large_text_blobs, index: 'help_articles')

Correct :

@client.indices.analyze(body: {text: 'long string', analyzer: :large_text_blobs}, index: 'help_articles')

system · March 23, 2017, 4:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom analyzer with standard tokenizer is splitting long tokens instead of discarding Elasticsearch	4	1193	July 5, 2017
Length Token Filter Elasticsearch	10	1721	July 6, 2017
How to limit token length? Elasticsearch	5	1853	April 24, 2017
Problem when using analyzers (very small data set) Elasticsearch	3	317	July 6, 2017
Whitespace analyzer (char-filter And token-filter) Elasticsearch	7	1217	November 27, 2019

How to use Limit Token Filter?

Related topics