Is there a way to apply the following code to a web page field that i have?
I have it in many types, so iseally, if i can add analyer, and reindex the data, that will be great..
Here is a node.js code( language is not important) , that includes the rules i need:
exports.cleanPage = function (page) {
if (!page || page.length ==0)
return page;
if (page == "/") {
return page;
}
if (!page.startsWith("/")) { // a page must be started with "/"
page = "/" + page;
}
if (page.endsWith('/')) { // we want to prevent duplication such as en,/en/,en/
page = page.slice(0, -1);
}
if (page.indexOf("?") > -1) {
page = page.split("?")[0]; //remove query string from page name
}
return page.toLowerCase();
there is no direct component to remove query parameters and normalize parts of URLs. There is the url uax email tokenizer, that leaves URLs as a single piece. You could however use the pattern replace tokenfilter to use regular expression to trim down your path.
I'd still recommend doing that before indexing data - this might make things much more explanatory than complex regular expressions.
With 5.0 you could take a look at the new ingest node feature assist you with this and change the field to your needs before indexing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.