Hi Oli,
I don't have a great "production ready" answer but you might be interested in this talk I've delivered several times. In it I give this demo which tokenizes images by position and RGB to find similar images. It's all rather naive and intended more for educational purposes.
Sujit Pal however came to one of my talks and wrote a series of blog articles on image similarity/search including this one and this one.
I personally don't mind hacking term frequency to mean some arbitrary feature strength. Creating dumb fields like "blue blue blue blue" to mean "blue" of strength "4" can work, with all the caveats (disable _source, field storage, analysis, etc). Others may think I'm crazy 