Loading and indexing dbpedia datasets in ES

Furabio · March 11, 2019, 11:30am

Hello,
I'm new to Elasticsearch and I am trying to load and index dbpedia datasets (RDF triples) into ES. The datasets are available in ttl format from https://wiki.dbpedia.org/downloads-2016-04.

My question is how do I load and index this into ES? Should I first convert the datato json format?

Thanks for your help.

Mark_Harwood · March 11, 2019, 11:38am

See https://www.youtube.com/watch?v=ZzWT-2xdaek
The comments section includes a link to some code

Furabio · March 11, 2019, 12:20pm

Hi, thanks for the reply.
I've already seen that tutorial but I miss the passage of the loading of the dataset in ES. When I try to run the python script, I get a connection error, although elasticsearch is running in the cloud. When elasticsearch is running locally instead, the python script is executed successfully, but on cmd I have a java.io.IOException and I see this error "[o.e.h.n.Netty4HttpServerTransport] [my_node] caught exception while handling client http traffic, closing connection"

Mark_Harwood · March 11, 2019, 1:34pm

You need to setup the connection details correctly.
This is an example of a python client connecting to an elastic cloud cluster:

import certifi
from elasticsearch.client import Elasticsearch

remoteEs = Elasticsearch(
		["xxxxxMY_CLOUD_ENDPOINT xxxxxx.found.io"],
		port=9243,
		http_auth="MY_USERNAME:MY_PASSWORD",
		use_ssl=True,
		verify_certs=True,
		ca_certs=certifi.where()
	)

response = remoteEs.search(index="MY_INDEX", body = myQuery)

Furabio · March 11, 2019, 7:03pm

Thanks again.
I solved it: the python script is executed successfully, and I have verified that the index is created into ES successfully.
However, I have one last problem: the index is empty.
I noticed that this depends on the fact that the script never enters the final loop, which should iterate over all the triples in the dataset (for line in file:, where file is obtained by with os.popen('bzip2 -cd ' + filename) as file:).

Also, I noticed that the import of bz2 is unused. Can the above error depend on this?

Mark_Harwood · March 11, 2019, 7:17pm

To be honest the code was probably originally written by searching stackoverflow for “how to read a bz2 file using python”. This probably isn’t the forum to discuss your problems with reading the raw data but I’d start by checking the file name is right.

system · April 8, 2019, 7:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Loading Sample Dataset Elasticsearch	7	5957	April 26, 2017
Help Using Python to Load Data into ES Elasticsearch	6	6353	October 7, 2019
Cannot connect to Elasticsearch cluster using cloud_id Elasticsearch	9	625	November 8, 2022
Elasticsearch Python Elasticsearch language-clients	3	1042	March 7, 2022
Import data from elasticsearch Elasticsearch	5	2721	July 20, 2018

Loading and indexing dbpedia datasets in ES

Related topics