In a separate but related problem, I have a field that contains URLS that can end with a variety of variables, some of which I may not want to factor into a result set for a unique terms aggregation. An example would be something like “user/profile?=1234” or “user/profile/1234,” where I would want to count both of these as a hit for the term “user/profile” and only as a hit for that term.
There are times where I would want "profile?=1234" or "/1234" and times where I would not want them. Is there also a way to include or exclude certain things like this from a search? And if I cannot exclude or include things like this, what might be the best solution to this issue when adding my data to elasticsearch?
Once it's been analysed you cannot do anything other than reindex to un-analyse it. You should use a multi-field mapping to create the analysed and then an additional un-analysed (ie .raw) field.
You second one also sounds like it could be solved by using multifields.
Thank you. I looked up multifields and can see the application to my first issue, but I was a bit unsure of their application to my second issue (counting "user/profile?=1234" or "user/profile/1234" as a hit for just "user/profile"). Could you elaborate on the second problem for me? If I now copy all urls to a non-indexed "raw" field, how might I perform a search that would equate the first two examples as a hit for just "user/profile" in an aggregation where "user/profile" would be a unique term and any extraneous text following "user/profile" (like ?=1234) would just count as a hit for the unique term "user/profile"?
I am using logstash. Can you elaborate a bit more on what you are proposing? Are you saying to have 3 url fields and to have logstash take the incoming url data and copy it to a new field (after removing the extraneous text)? Sorry I am not following so well.
I was thinking that you could break the user/profile?=1234 or user/profile/1234 bits out into their own field and then drop everything after the profile part.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.