I've been searching around trying to find the best way to do this, but
haven't really found anything so far. Any help would be appreciated.
Let's say I have simple documents that come in through a custom River:
doc:
title: string
content: string
And I want to add the following fields (numberOfNames, numberOfWords,
numberOfPlaces) and populate them based on a custom parsing of the content
field. Essentially I'm analyzing one field of the original document and
using it to populate addition fields (not part of the original document) so
that I can provided facetted search on these additional fields. Are there
some existing approaches to do this? I would imagine this is pretty common
but haven't been able to find much out there.
I was thinking of making a new Analyzer, something like MetadataAnalyzer
where you could configure a sequence of Tokenizer and Filter objects that
lead to the tokens to be indexed for a given field. For example you could
do something like this:
If the only difference between each field is the analysis, you could
use multi-fields on the original source field.
Each field can have its own analyzer (custom or not). The primary use
of multi-field is for when you want to define different analyzers on
the same source field.
I've been searching around trying to find the best way to do this, but
haven't really found anything so far. Any help would be appreciated. Let's
say I have simple documents that come in through a custom River:
doc:
title: string
content: string
And I want to add the following fields (numberOfNames, numberOfWords,
numberOfPlaces) and populate them based on a custom parsing of the content
field. Essentially I'm analyzing one field of the original document and
using it to populate addition fields (not part of the original document) so
that I can provided facetted search on these additional fields. Are there
some existing approaches to do this? I would imagine this is pretty common
but haven't been able to find much out there.
I was thinking of making a new Analyzer, something like MetadataAnalyzer
where you could configure a sequence of Tokenizer and Filter objects that
lead to the tokens to be indexed for a given field. For example you could
do something like this:
Each field can have its own analyzer (custom or not). The primary use
of multi-field is for when you want to define different analyzers on
the same source field.
I've been searching around trying to find the best way to do this, but
haven't really found anything so far. Any help would be appreciated. Let's
say I have simple documents that come in through a custom River:
doc:
title: string
content: string
And I want to add the following fields (numberOfNames, numberOfWords,
numberOfPlaces) and populate them based on a custom parsing of the content
field. Essentially I'm analyzing one field of the original document and
using it to populate addition fields (not part of the original document) so
that I can provided facetted search on these additional fields. Are there
some existing approaches to do this? I would imagine this is pretty common
but haven't been able to find much out there.
I was thinking of making a new Analyzer, something like MetadataAnalyzer
where you could configure a sequence of Tokenizer and Filter objects that
lead to the tokens to be indexed for a given field. For example you could
do something like this:
Wouldn't using a multi-field for this scenario cause the same field to be tokenized 4 different times? Once for the normal text field tokenizaiton, and three times for each metric you are calculating.
Is there a way to perform all three analyses in one analyzer pipeline and then store the 3 resulting metrics to new fields?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.