How to better search across dozens or hundred fields?


(Rodrigo Rosenfeld Rosas) #1

Hi, this is my first time here (it's also my first time using this tool, as I'm usually communicating through a mailing list). I'm a beginner with regards to full-text search engines in general. I've used Solr for some years but only to some not too complex queries or field types.

I'm currently working on a new product and this time I need a much more advanced use case and I couldn't find any examples on how to handle something like that in the documentation or anywhere else in the Internet.

It's basically a collection management system where items can have an arbitrary number of properties (something like a few dozens to a few hundreds, it's hard for me to be sure about it as we're in a very early stage). There are authorization rules preventing some users from having access to all of those fields and those rules can change any time and so can the number of fields attributes an item could hold.

That means I'd have to reindex the entire collection (often thousands of items per client) everytime a new field would be added or removed from the template if I was going to use some ElasticSearch field with all attributes concatenated into it. Not only that but I'd have to create many combination of fields like that, one per possible permissions set an user could have. Also, changes in the permissions would also trigger the application to reindex the entire collection(s).

So, I was considering indexing each field separately. The client wants to search the collection using a single field with words they're interested in and this search should be performed across every field in the template.

I've come across the multi-match query that would allow me to specify up to 1024 fields, which seems enough for our needs. I'd then be able to specify which fields are allowed to be seen by the user performing the query and specify only those fields to the multi-match query. If a new field would be added to the template I wouldn't have to reindex the entire collection because those fields would be initially blank. If I deleted one field from the template that wouldn't be a problem either as that field would no longer be included in the multi-match queries, so no need to reindex the collection either.

Most examples talk about using a few fields such as First Name and Last Name, but I'm worried its performance could get really bad when using dozens or hundreds of fields in the query as the equivalent Lucene queries could become very big. My understanding is that users would like all words to match the concatenation of all fields in the template, so I thought about using the cross_fields strategy with the AND operator.

Also, users would like to see which fields matched any word of the provided text, and I'm not sure what is the best way to achieve that although I suspect I'd had to enable highlighting in the query for this.

I suspect the requirements above are not that uncommon even if I wasn't able to find any examples when looking for them.

Is the multi-match query using the cross_fields strategy the way to go for something like that? Should I use highlightings in order to figure out which fields matched any word in the query? Will I have severe performance issues using that strategy if the template has a few dozen searchable fields? What about a hundred? I'm talking about a collection of thousands of items to a hundred thousands. I expect most of the fields to be short sentences while a few of them could have some longer description such as a few dozen lines of text maybe.

Are there any other strategies I should consider? I suspect, from my experience with Solr, that reindexing the full database could take hours, so it's not an option for a regular usage. I'd love to hear more suggestions on how to design such ElasticSearch database.

Also, some fields would be of an options type, rather than text. Option fields have values that could be one of a pre-defined list of possible labels. Those labels could be renamed any time, in the RDBMS items hold a reference to the label sequential id rather than to the label text itself, so it's quite fast to rename a label. However, if I index the label when indexing the item, renaming a label would trigger an update to all items in the collection pointing to that label, which could take a really long time depending on the size of the collection and how many of the items would refer to that label. Is it possible to somehow create JOINs in the ElasticsSearch queries so that we could join on the label numeric id and then have those numbers mapped to the corresponding current labels just like we'd do in an RDBMS? I mean, I'd like those mapped labels to enter the cross_fields list of fields if possible.

I'd appreciate any suggestions to create such user experience that could be fast enough to enable the search to complete in just a few seconds or so in the worst cases. Is that somehow possible with ElasticSearch (or Solr or any other search engine I could use)?

Sorry for the huge description but I couldn't think of any way to make it shorter while detailing exactly what I'm looking for.

Thanks in advance for any suggestions :slight_smile:

Cheers,
Rodrigo.


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.