is your collection relatively static?
The best you can do is to use static huffman and/or static dictionary
compression (eventually with escapeing for new symbols)
There are some string values that appear repeatedly, and I've looked at tokenization of strings prior to indexing, where at the app layer we would handle tokenizing user queries and untokenization search results. The challenge here is that for tokenization to be effective for our data, we have to tokenize phrases, at which point we start losing some phrase match granularity since ES will only see the tokens, and not the words within the tokens.
you could try to compress document yourself into one binary field, but
in order to achieve reasonable performance and compression on such a
small documents you have to use some form of the static compression.
I think the underlying challenge is that the resulting raw data and associated terms/analyzed values/etc. is stored uncompressed in ES, except for _source which we can enable compress or disable storing altogether (and retrieve each result individually). So even in the best case we are probably always looking at 2-3x increase in disk capacity vs. original raw data. The problem would be a bit simplified if we were indexing large documents, such as web pages or office documents.
for example, if you have a field containing only numbers, you could
build static huffman to compress it at leeast 50% as you could store
10 simbols in 4 bits, or if you have a field with low cardinality
(e.g. zip code), you could use simple dictionary compression...
That's a good idea, and we can do some shortening with things like IP addresses, zip codes, credit card numbers, etc. but in the end it only saves a little disk space, and also prevents us from performing range searches on the data.
I think ES supports binary fields, but I do not know if there is any
infrastructure to plug in something like that into server (you can do
it on client code). I did it before on raw lucene stored field and
reduced total size to some 40% of the original
We are looking at client-side grouping of documents to 1) reduce the amount of terms that are stored, and 2) take advantage of _source compression. The idea is to take say 1000 small documents (200-400 bytes each) and insert them as a single document. At search time, we will have to process the resulting data client-side. There are some drawbacks to this - the main issue being that Lucene query syntax is applied on the group of documents and complex queries requires us to do a lot of client-side processing. For example, we won't be able to provide an accurate count of matches because we have to take the results returned by ES and process each one individually. But for some limited search applications, this might work.