I have a total of around 500,000 documents that was earlier inserting using bulk request of 1000 at a time.
But before I had an ingestion plugin I using a library externally to transform the input and then collect them for a bulk insertion.
Now that I have an ingestion plugin for my purpose, how can I use it to perform bulk document transforms and then send the documents for a bulk insertion?
I used this command to ingest a document using a pipeline I created:
curl -X PUT 'http://localhost:9200/test_index/review/1?pipeline=apply-vader-review' -d '{
"content": "The plot was good, but the characters are uncompelling and the dialog is not great."
}'
The ingested document looks like this:
{
"_index":"test_index",
"_type":"review",
"_id":"1",
"_score":1,
"_source":{
"content":"The plot was good, but the characters are uncompelling and the dialog is not great.",
"polarity":{
"negative":0.327,
"neutral":0.579,
"positive":0.094,
"compound":-0.7042
}
}
}
I see that I have to give an _id for a request. How will I follow this format if I want the document id to be auto-generated during a bulk ingestion?
I figured it out. This is _bulk request I tried and was successful.
curl -X POST 'http://localhost:9200/test_index/review/_bulk?pipeline=apply-vader-review' -d
'{ "index" : { "_index" : "test_index", "_type" : "review"} }
{"content": "The plot was good, but the characters are uncompelling and the dialog is not great."}
{ "index" : { "_index" : "test_index", "_type" : "review"} }
{"content": "The plot was good, but the characters are uncompelling and the dialog is not great."}
{ "index" : { "_index" : "test_index", "_type" : "review"} }
{"content": "The plot was good, but the characters are uncompelling and the dialog is not great."}'
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.