Elasticsearch 5 conversion


(David Earl) #1

Having spent a while since it was released converting my app from es2.4 to es5, I thought I'd just say that once I'd got over setup changes, it all seems to be working very smoothly. Thank you.

The only errors I've had have been mine in doing the necessary changes. My app ( https://cameo-membership.uk ) makes very diverse queries, both paged and direct, and lots of different kinds of update. My data sets are quite small - tens of thousands of records, rather than millions or billions, but on the other hand, it runs on dedicated Raspberry Pis, so rather modest memory and processor speeds.

I've been pleasantly surprised how robust the first release of es5 seems to be. Subjectively it certainly feels faster, especially on paged searches. I only hit one problem where something didn't work in es5 that had in 2.4 ( Getting 500 error in es5.0.0 ), easily fixed once I understood the problem.

The software upgrade required careful reading of the documentation:

  • requiring Java 1.8 is a bit annoying, as this is beyond the most recently supported version on Debian Jessie, so I had to think about how to get it rather than it just being there, and I wasn't confident that OpenJRE1.8.0 would be sufficient (I gather it is).
  • The default file descriptor limit is just ONE less than that required! And I found conflicting instructions for what do about this, neither of which worked, but I found the right config file in the end.
  • A clash between the directory name and the cluster name gave me an error I still can't see what to do with. But in the end, as I was reloading all the data anyway, I just deleted everything and started over.
  • A seemingly gratuitous change that disallows POST for index creation had me puzzled for ages (though I did find the documented change when I looked) - since all other PUTs allow POSTs, this seems a rather unnecessary limitation, though I'm sure there are reasons.

I took the opportunity to review all my mappings, remove crud that had hung around from early days, and consider carefully where I should use the new text vs keyword. This threw up an unexpected problem that I wasn't aware of in 2.4: that it isn't possible to sort on text (formerly indexed strings), as opposed to keyword, fields (or at least, to do so would require a change which is clearly not recommended). Was I using text simply because I was re-using my "name" field inappropriately which was text (indeed text with a case-folding and accent-folding analyzer)? I concluded not - I really do want the possibility of finding at least some these entities by a general search, but also to be able to present them in order when I get them. However, given the small number of records usually involved and that I'm generally getting them all, it's not a big deal to sort them on receipt.

The main change for me was and/or=>bool. While bool is more general, I think it is also harder to read and follow in the code, and meant widespread though localised code changes, each of which was subject to making a mistake. I think it's a shame and/or have gone. What I've ended up with in my code is effectively a bool generator for and/or which takes and array of queries, so it still looks like and/or in the code. (There are occasions where bool is useful too for its generality, I was previously using both).

So in conclusion, my experience has been largely positive with some minor frustrations, but I'm still wary of jumping too early because of the extent of code changes it has needed, and therefore testing required.


Internationalisation strategy for product catalogue?
(evert) #2

Nice review!

One point I would like to know... did you have a custom analyzer? How did you manage to fix it?

You mentioned accent-folding analyzer, reason why I am asking.

I used to have in my mapping:

'properties'    => [
    'file'      => [
        'type'      => 'attachment',
        'fields'    => [
            'content'   => [
                'type'          => 'string',
                'term_vector'   => 'with_positions_offsets',
                'store'         => true,
                'analyzer'      => 'brazilian'
            ]
        ]
    ],
]

Please note the analyzer, and now, with ingest, a lot faster, as I could test, I have the processors working, extracting the encoded64 pdf content, but, the analyzer for brazilian accents is not bringing results if not exactely writen.

I mean, on ES2.4 I could searh for: pródigo or prodigo would bring the same result, now on ES5.0 it does not.

Did you have this kind of problem?

The docs for this subject has not been updated yet.

Thanks in advance!


(Mark Walkom) #3

Thanks for the comments, it's always good to get feedback :smiley:

Given 1.7 is EOL, I'd say the problem is with Debian, not java or ES :wink:


(David Earl) #4

Well, perhaps, but Debian 8 is only a year old. My experience of using managed servers elsewhere is that they generally work on significantly older versions of Linux - CentOS for example a whole version behind the current release, and on a managed server you often don't have the option to upgrade. Mind you, managed servers don't come with elasticsearch either :slight_smile:

If anyone else needs this for Debian 8 Jessie, the recipe is (as root or sudo):

  1. add deb http://http.debian.net/debian jessie-backports main to /etc/apt/sources.list
  2. apt-get update
  3. apt-get remove openjdk-7-jre-headless openjdk-7-jdk
  4. apt-get install openjdk-8-jre-headless openjdk-8-jdk

(David Earl) #5

So, I am also doing this (this is one of the principal reasons I chose to use elasticsearch for this project). But I didn't change how I do this for es5, except that the fields on which I'm working are now "text" fields rather than "string". I have checked this transliteration matching works in es5.

So I have, for example:

"my_type": {
    "properties": {
        "name": {
            "type": "text",
            "analyzer": "st_lc_af"
        },
        ...
    }
}

and in my index creation I have:

{
    "settings": {
          ...
          "analysis": {
              "analyzer": {
                  "st_lc_af": {
                     "filter": [
                        "standard",
                        "lowercase",
                        "asciifolding"
                     ],
                     "tokenizer": "standard"
                 }
             }
         }
     }
}

(David Earl) #6

I should say, it is of course the "asciifolding" bit that does the transliteration or "de-accented matching"


(evert) #7

thanks!


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.