Analyzer for web page name


(liorg2) #1

Is there a way to apply the following code to a web page field that i have?
I have it in many types, so iseally, if i can add analyer, and reindex the data, that will be great..
Here is a node.js code( language is not important) , that includes the rules i need:

exports.cleanPage = function (page) {

if (!page || page.length ==0)
    return page;

if (page == "/") {
    return page;
}

if (!page.startsWith("/")) { // a page must be started with "/"
    page = "/" + page;
}

if (page.endsWith('/')) { // we want to prevent duplication such as en,/en/,en/
    page = page.slice(0, -1);
}

if (page.indexOf("?") > -1) {
    page = page.split("?")[0]; //remove query string from page name
}

return page.toLowerCase();

};

I'm using 1.7


(Alexander Reelsen) #2

Hey,

there is no direct component to remove query parameters and normalize parts of URLs. There is the url uax email tokenizer, that leaves URLs as a single piece. You could however use the pattern replace tokenfilter to use regular expression to trim down your path.

I'd still recommend doing that before indexing data - this might make things much more explanatory than complex regular expressions.

With 5.0 you could take a look at the new ingest node feature assist you with this and change the field to your needs before indexing.

--Alex


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.