How to use Limit Token Filter?

Hi, I'm trying to use the limit token filter to cap the amount of data stored in elasticsearch from a potentially very large text string in my main database.

I have in my settings this filter and analyzer specified:

analysis": {
  "filter": {
    "max_size_tokens": {
      "type": "limit",
      "max_token_count": "50"
    }
},
"analyzer": {
  "large_text_blobs": {
    "filter": [
      "lowercase",
      "unique",          
      "max_size_tokens",
    ],
    "type": "custom",
    "tokenizer": "standard"
  }
}

but analyzing a long string is returning more than the 50 tokens I specified. (Example string is just the numbers 1 - 200, with comma and space between; it returns all 200 tokens.)

BTW this is on elasticsearch 5.2.1, running locally, and the consume_all_tokens option does not appear to have any effect on this.

The limit token filter will only affect the indexed value, not what is returned in results via the _source. If you want to limit the field in general you should look into using an ingest pipeline which would be applied before indexing.

Yes, I'm not looking to change the source value. The extra tokens I'm seeing are from looking at the results of _analyze using the specified filter:

curl "http://localhost:9200/customers/_analyze?analyzer=large_text_blobs&index=customers&text=1%2C+2%2C+3%2C+4%2C+5%2C+6%2C+7%2C+8%2C+9%2C+10%2C+11%2C+12%2C+13%2C+14%2C+15%2C+16%2C+17%2C+18%2C+19%2C+20%2C+21%2C+22%2C+23%2C+24%2C+25%2C+26%2C+27%2C+28%2C+29%2C+30%2C+31%2C+32%2C+33%2C+34%2C+35%2C+36%2C+37%2C+38%2C+39%2C+40%2C+41%2C+42%2C+43%2C+44%2C+45%2C+46%2C+47%2C+48%2C+49%2C+50%2C+51%2C+52%2C+53%2C+54%2C+55%2C+56%2C+57%2C+58%2C+59%2C+60%2C+61%2C+62%2C+63%2C+64%2C+65%2C+66%2C+67%2C+68%2C+69%2C+70%2C+71%2C+72%2C+73%2C+74%2C+75%2C+76%2C+77%2C+78%2C+79%2C+80%2C+81%2C+82%2C+83%2C+84%2C+85%2C+86%2C+87%2C+88%2C+89%2C+90%2C+91%2C+92%2C+93%2C+94%2C+95%2C+96%2C+97%2C+98%2C+99%2C+100%2C+101%2C+102%2C+103%2C+104%2C+105%2C+106%2C+107%2C+108%2C+109%2C+110%2C+111%2C+112%2C+113%2C+114%2C+115%2C+116%2C+117%2C+118%2C+119%2C+120%2C+121%2C+122%2C+123%2C+124%2C+125%2C+126%2C+127%2C+128%2C+129%2C+130%2C+131%2C+132%2C+133%2C+134%2C+135%2C+136%2C+137%2C+138%2C+139%2C+140%2C+141%2C+142%2C+143%2C+144%2C+145%2C+146%2C+147%2C+148%2C+149%2C+150%2C+151%2C+152%2C+153%2C+154%2C+155%2C+156%2C+157%2C+158%2C+159%2C+160%2C+161%2C+162%2C+163%2C+164%2C+165%2C+166%2C+167%2C+168%2C+169%2C+170%2C+171%2C+172%2C+173%2C+174%2C+175%2C+176%2C+177%2C+178%2C+179%2C+180%2C+181%2C+182%2C+183%2C+184%2C+185%2C+186%2C+187%2C+188%2C+189%2C+190%2C+191%2C+192%2C+193%2C+194%2C+195%2C+196%2C+197%2C+198%2C+199%2C+200"

This returns the full list of tokens, where I'd expect only the first 50 (or really any 50, it doesn't matter for this).

{"token":"1","start_offset":0,"end_offset":1,"type":"<NUM>","position":0},
...
{"token":"199","start_offset":882,"end_offset":885,"type":"<NUM>","position":19998},
{"token":"200","start_offset":887,"end_offset":890,"type":"<NUM>","position":20099}]}

I only have a 5.2.0 instance to test with right this second but using analyze api it works:

curl -XPOST 'http://localhost:9200/_analyze' -d '{
	"tokenizer": "standard",
	"filter": ["lowercase", "unique", {"type": "limit", "max_token_count": 5}],
	"text": "1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
}'

Gives me:

{
  "tokens": [
    {
      "token": "1",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "2",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "3",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<NUM>",
      "position": 2
    },
    {
      "token": "4",
      "start_offset": 9,
      "end_offset": 10,
      "type": "<NUM>",
      "position": 3
    },
    {
      "token": "5",
      "start_offset": 12,
      "end_offset": 13,
      "type": "<NUM>",
      "position": 4
    }
  ]
}

Hmm, thanks for pointing me at that, your data looks like it's from Kibana...

It looks like the curl generated by 'copy as curl' from kibana is formatted differently than my ruby gem's .analyze method does it, and for some reason that breaks it?

From Kibana (returns the expected amount of data):

curl -XPOST "http://localhost:9200/help_articles/_analyze" -d'
{
  "text": "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100",
  "analyzer": "large_text_blobs"
}'

If I use the query string, even in Kibana, it returns the full set of tokens instead of limiting it.

curl -XGET "http://localhost:9200/help_articles/_analyze?analyzer=large_text_blobs&index=help_articles&text=1%2C+2%2C+3%2C+4%2C+5%2C+6%2C+7%2C+8%2C+9%2C+10%2C+11%2C+12%2C+13%2C+14%2C+15%2C+16%2C+17%2C+18%2C+19%2C+20%2C+21%2C+22%2C+23%2C+24%2C+25%2C+26%2C+27%2C+28%2C+29%2C+30%2C+31%2C+32%2C+33%2C+34%2C+35%2C+36%2C+37%2C+38%2C+39%2C+40..."

Okay, in case anyone else runs into similar problems, the query string version of this is deprecated, and while it's surprising that it doesn't work, it's not going to be supported in the future anyway.

The ruby gem does actually do either, depending on how you make the call, and the details of that are a bit hidden.

Incorrect (generates the url-params version):

@client.indices.analyze(text: 'long string', analyzer: :large_text_blobs, index: 'help_articles')

Correct :

@client.indices.analyze(body: {text: 'long string', analyzer: :large_text_blobs}, index: 'help_articles')

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.