What would be the "best" (optimal?) strategy to strip html links (hyperlinks) from string fields?

Hermano_Cabral · October 7, 2014, 9:03pm

Howdy,

What would be the "best" way to strip hyperlinks (eg. http://google.com,
www.facebook.com, etc.) and avoid them being analyzed? So far I've been
using the pattern_replace char filter with reasonable success, but the
regex is getting quite big/complex to handle all the edge cases and even
tho we're still experimenting with ES, I'm starting to worry about the
performance impact of doing this when we start to ingest large volumes of
data into our ES cluster. Would the pattern_replace token filter be a
better option here?

Cheers!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e1ba63a2-c8c1-4c65-811d-40c6b70fefd1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · October 7, 2014, 9:05pm

I do all my HTML munging in the application that sends data to
Elasticsearch. I know that isn't much help, but it does work.

On Tue, Oct 7, 2014 at 5:03 PM, Hermano Cabral <
hermanocabral@creactive.com.br> wrote:

Howdy,

What would be the "best" way to strip hyperlinks (eg. http://google.com,
www.facebook.com, etc.) and avoid them being analyzed? So far I've been
using the pattern_replace char filter with reasonable success, but the
regex is getting quite big/complex to handle all the edge cases and even
tho we're still experimenting with ES, I'm starting to worry about the
performance impact of doing this when we start to ingest large volumes of
data into our ES cluster. Would the pattern_replace token filter be a
better option here?

Cheers!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e1ba63a2-c8c1-4c65-811d-40c6b70fefd1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e1ba63a2-c8c1-4c65-811d-40c6b70fefd1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1Y_9C-R3cwv6P8kO36%3DOyGG0DnhJoLUv0LiaqyAEKb3Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hermano_Cabral · October 7, 2014, 9:48pm

Yeah, thanks for the idea but unfortunately that's not really an option for
me as I have no control over how the data gets sent to ES.

On Tue, Oct 7, 2014 at 6:05 PM, Nikolas Everett nik9000@gmail.com wrote:

I do all my HTML munging in the application that sends data to
Elasticsearch. I know that isn't much help, but it does work.

On Tue, Oct 7, 2014 at 5:03 PM, Hermano Cabral <
hermanocabral@creactive.com.br> wrote:

Howdy,

What would be the "best" way to strip hyperlinks (eg. http://google.com,
www.facebook.com, etc.) and avoid them being analyzed? So far I've been
using the pattern_replace char filter with reasonable success, but the
regex is getting quite big/complex to handle all the edge cases and even
tho we're still experimenting with ES, I'm starting to worry about the
performance impact of doing this when we start to ingest large volumes of
data into our ES cluster. Would the pattern_replace token filter be a
better option here?

Cheers!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e1ba63a2-c8c1-4c65-811d-40c6b70fefd1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e1ba63a2-c8c1-4c65-811d-40c6b70fefd1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1Y_9C-R3cwv6P8kO36%3DOyGG0DnhJoLUv0LiaqyAEKb3Q%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1Y_9C-R3cwv6P8kO36%3DOyGG0DnhJoLUv0LiaqyAEKb3Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAE%2BD6X%3DigVS42_XxP2ptW%3D2qqzF%2BjLrYkd-tJNd%3D9TrHHfZ5sA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hermano_Cabral · October 9, 2014, 3:05pm

Anyone?

On Tue, Oct 7, 2014 at 6:48 PM, Hermano Cabral <
hermanocabral@creactive.com.br> wrote:

Yeah, thanks for the idea but unfortunately that's not really an option
for me as I have no control over how the data gets sent to ES.

On Tue, Oct 7, 2014 at 6:05 PM, Nikolas Everett nik9000@gmail.com wrote:

I do all my HTML munging in the application that sends data to
Elasticsearch. I know that isn't much help, but it does work.

On Tue, Oct 7, 2014 at 5:03 PM, Hermano Cabral <
hermanocabral@creactive.com.br> wrote:

Howdy,

What would be the "best" way to strip hyperlinks (eg. http://google.com,
www.facebook.com, etc.) and avoid them being analyzed? So far I've been
using the pattern_replace char filter with reasonable success, but
the regex is getting quite big/complex to handle all the edge cases and
even tho we're still experimenting with ES, I'm starting to worry about the
performance impact of doing this when we start to ingest large volumes of
data into our ES cluster. Would the pattern_replace token filter be
a better option here?

Cheers!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e1ba63a2-c8c1-4c65-811d-40c6b70fefd1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e1ba63a2-c8c1-4c65-811d-40c6b70fefd1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1Y_9C-R3cwv6P8kO36%3DOyGG0DnhJoLUv0LiaqyAEKb3Q%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1Y_9C-R3cwv6P8kO36%3DOyGG0DnhJoLUv0LiaqyAEKb3Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAE%2BD6XkLz-d%2BBZybbh6%2Bp%3D-2rWsqkavFC0ogQh0qeFTBXzxyww%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
How to use standard_html_strip Elasticsearch	8	517	July 6, 2017
How to get char_filter to work? Elasticsearch	14	1202	July 6, 2017
Tokenizing HTML Elasticsearch	5	649	July 6, 2017
Strip_HTML on indexing does not store results? Elasticsearch	10	968	July 6, 2017
How to use html_strip Char filter? Elasticsearch	5	1907	July 6, 2017

What would be the "best" (optimal?) strategy to strip html links (hyperlinks) from string fields?

Related topics