Solr has this Tagger Handler feature, is there any ES out of the box equivalent for that?
How does the tagger works ?
The tagger handler relies on a dedicated collection in which it stores the entities to be extracted. In this collection, one field is used to store the texts used to recognize each entity, and you may create as many other fields as you want to store other useful information about your entities.
Assume we want to recognize city names in our documents. We can have fields storing the timezone, the localization (longitude and latitude) as well as the country of the city. The “tag” field could contain the different names that are used to designate the city, such as “New york City” and “NYC” for example.
Once the collection is created and populated, and the handler properly configured, you can use the handler, passing it text and receiving the list of entities found into the provided text. The matching is done only using the text provided into the “tag” field, but you can ask the tagger to return all the fields you want from the entity using the standard fl parameter.
Thanks for your reply, this is a pretty interesting plugin. But the idea is to not use a NER Model... instead I have a CSV file holding the Entity Ids, names and synonyms. The Solr Tagger Handler can read a CSV file like that, for example.
Also the intention is not to index the text, but just get their tags returned. So I was wondering if there's any ES out of the box for that, but I'm afraid there isn't.
AFAIK the Solr text tagger is a kind of gazetteer (dictionary based) and uses a second index to store the named entities and associated meta data. Thats as much as I remember from using it about 5 years ago with Solr. The percolate API might get you some way there, at least for a reasonably low number of entities (probably in the low 10s of thousands, but maybe not in the millions but I might be mistaken there).
That said, I did some experiments with making the text tagger code work as an ES plugin a while ago but it alway got stalled for some reason (mostly lack of time). It would be nice to get a better feeling about your usecase, whether you can or why you cannot use Percolator to see if this would be a good addition to the ES plugin ecosystem.
Sorry for the late reply. Percolate seems indeed to be the closest ES tool for what we are aiming. The project was put on hold for now, though. We intend to take a better look at it in another moment (if it happens, I'll leave the feedback here). Thanks!
In a nutshell, the intention is: to store data from a CSV containing information like: id, name and synonyms (the most important attribute), which would be the tags. Then, via rest, we'd send text to it and get the related tags as response (it could be synonyms, the name or any other stored column).
I haven't checked deeper the Percolate yet to check how well it would serve us. If we get there, I'll leave a more accurate feedback here.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.