Processing HTML content using java transport client within JSON files ingested using Logstash and Kafka

phoenix9 · September 10, 2018, 10:21am

I am trying to ingest files (articles with a heading, body, etc) into ElasticSearch using Logstash and Kafka. The files are in JSON (fields - title, body, etc) where the body field is in HTML. My ElasticSearch version is 5.6. I am using Java REST Client (TransportClient) to interact with ElasticSearch.

How do I process the original files ingested using Logstash into ElasticSearch, to replace all the tags and their attributes appropriately, and store it back in the same index?

Q1)HTML content processing using Java
Q2)Processing on the Logstash ingested files and storing only the processed files.

I tried using AnalyzeRequest's addCharFilter's html_strip, which i thought is the equivalent of HtmlStripCharFilter (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html), but html_strip returns tokens with only the symbols '<' and '>' removed, but not the tags and attributes, unlike HTMLStripCharFilter.

I am attaching my code for this.

String text = "<p>tag<p>";
AnalyzeRequest analyzeRequest = (new AnalyzeRequest(indexName)).text(text).addCharFilter("html_strip");
List<AnalyzeResponse.AnalyzeToken> tokens = client.admin().indices().analyze(analyzeRequest).actionGet().getTokens();
String tmp = "";
for(AnalyzeResponse.AnalyzeToken token:tokens) {
tmp += token.getTerm() + " ";
}
System.out.println(tmp);

Output: I expect - "tag", but I get - "p tag p"

I also came across the PreBuiltCharFilters (also HtmlStripCharFilterFactory), but I am very unsure about how to use this. And i did not come across any code examples for a similar use case.

So how do I go about this? Thank You.

system · October 8, 2018, 10:21am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingest attachment plugin not analysing some html files Elasticsearch	15	1207	March 30, 2018
Ingesting HTML file into elasticsearch Elasticsearch	6	5002	June 29, 2017
Pattern for Indexing HTML Documents Elasticsearch	3	2970	July 26, 2017
Elasticsearch attachment parsing usecase Elasticsearch	6	725	May 2, 2017
Indexing HTML documents, problems with JSON Elasticsearch	5	981	July 6, 2017

Processing HTML content using java transport client within JSON files ingested using Logstash and Kafka

Related topics