Processing HTML content using java transport client within JSON files ingested using Logstash and Kafka

I am trying to ingest files (articles with a heading, body, etc) into ElasticSearch using Logstash and Kafka. The files are in JSON (fields - title, body, etc) where the body field is in HTML. My ElasticSearch version is 5.6. I am using Java REST Client (TransportClient) to interact with ElasticSearch.

How do I process the original files ingested using Logstash into ElasticSearch, to replace all the tags and their attributes appropriately, and store it back in the same index?

Q1)HTML content processing using Java
Q2)Processing on the Logstash ingested files and storing only the processed files.

I tried using AnalyzeRequest's addCharFilter's html_strip, which i thought is the equivalent of HtmlStripCharFilter (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html), but html_strip returns tokens with only the symbols '<' and '>' removed, but not the tags and attributes, unlike HTMLStripCharFilter.

I am attaching my code for this.

String text = "<p>tag<p>";
AnalyzeRequest analyzeRequest = (new AnalyzeRequest(indexName)).text(text).addCharFilter("html_strip");
List<AnalyzeResponse.AnalyzeToken> tokens = client.admin().indices().analyze(analyzeRequest).actionGet().getTokens();
String tmp = "";
for(AnalyzeResponse.AnalyzeToken token:tokens) {
tmp += token.getTerm() + " ";
}
System.out.println(tmp);

Output: I expect - "tag", but I get - "p tag p"

I also came across the PreBuiltCharFilters (also HtmlStripCharFilterFactory), but I am very unsure about how to use this. And i did not come across any code examples for a similar use case.

So how do I go about this? Thank You.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.