Ingest Pipeline processing issue in for loop

Jay_Timbadia · November 3, 2024, 6:31am

I am running ingest pipeline in loop for an array of PDF files.

I am creating new index for each file. But sometimes the pipeline gives 201 and sometimes it gives 200.

In 201, it just creates the index but does not process it.
Any reason what is the problem here?

Code:

for idx, f in enumerate(target_files):
    message = ""
    pdf_filename = f.name
    print(pdf_filename)
    response = File.open_binary(ctx, folder_url + pdf_filename)  # sharepoint read

    encoded = base64.b64encode(response.content).decode('utf-8')
    payload = {
        "data": encoded
    }

    index_name = re.sub(r'[\W_]+', '_', pdf_filename.lower())

    url = f"http://localhost:9200/{index_name}/_doc/1?pipeline=my-pipeline"

    response = requests.post(url, json=payload)

 
    print(response.status_code)  # sometimes it gives 200, sometimes 201.

    if es.indices.exists(index=index_name):
        print("index created by ingest pipeline!")
    
    result = es.search(index=index_name)
    result = result["hits"]["hits"]  # when 201, its empty, if 200, its processed.
    print(len(result))

    output_text = result[0]["_source"]["attachment"]["content"]

    es.indices.delete(index=index_name)

Christian_Dahlqvist · November 3, 2024, 6:38am

Why would you do this?? That sounds crazy to me.

Jay_Timbadia · November 3, 2024, 6:49am

If I am overwriting in single one, then the results are getting very weird.

That's why I did tried all the approaches.

flag = False
count = 0

url = f"http://localhost:9200/pdf_index/_doc/1?pipeline=my-pipeline"

final_data = []
for idx, f in enumerate(target_files):
    message = ""
    pdf_filename = f.name
    print(pdf_filename)
    response = File.open_binary(ctx, folder_url + pdf_filename)

    encoded = base64.b64encode(response.content).decode('utf-8')
    payload = {
        "data": encoded
    }

    response = requests.post(url, json=payload)
    
    result = es.search(index=index_name)
    result = result["hits"]["hits"]
    print(len(result))

    output_text = result[0]["_source"]["attachment"]["content"]    
    print(output_text[:100])

output:

test.pdf
1
OIL-INJECTED ROTARY SCREW  COMPRESSORS  GA 90+-160/GA 110-160 VSD (90-160 kW/125-200 hp)

test1.pdf
1
OIL-INJECTED ROTARY SCREW COMPRESSORS GA 90+-160/GA 110-160 VSD (90-160 kW/125-200 hp)

you can see the fine print "print(output_text[:100])" is same for both the files, which shouldn't be.

Christian_Dahlqvist · November 3, 2024, 6:54am

Indices are like databases so you should not write a single document to each as that will not scale well at all. Instead write all documents to a single index as that is how you are supposed to use Elasticsearch and indices with multiple primary shards can be terabytes in size.

If you are seeing issues when overwriting documents, describe exactlky what those are and try to resolve them.

"the results are getting very weird" is very vague. You need to describe the issues in a lot more detail if someone is going to be able to help.

Jay_Timbadia · November 3, 2024, 7:00am

I have pasted a screenshot of the result which to any proper english reader will make sense of weirdness.

The length of the output is 1. And the content is showing same for both the files after processing by ingest pipeline, which should not be the case, since file names are different. (test.pdf & test1.pdf) and so are its content.

The output content

OIL-INJECTED ROTARY SCREW  COMPRESSORS  GA 90+-160/GA 110-160 VSD (90-160 kW/125-200 hp)

is from test.pdf but test2.pdf has different content, but its showing similar to first file.

To your point of dumping all files to single index, it may or may not work, I have not tried, but the code I have written above is also not incorrect.

It should also work.
There can be many soln to final output, but all should work, right?

Christian_Dahlqvist · November 3, 2024, 7:15am

I find this quite rude. Given that I am volunteering here I do not think I want to spend any more time on this thread. Good luck.

Jay_Timbadia · November 3, 2024, 7:38am

Thanks!

Topic		Replies	Views
Problem ingesting PDF Elasticsearch	1	592	September 23, 2019
Elasticsearch - attachment using Ingest - with node.js Elasticsearch	2	2614	June 21, 2017
Index (ingest attachment) update in elastic search or storing multipe document in same index Elasticsearch	14	2403	September 4, 2017
Ingestion Pipeline Elasticsearch	2	17	October 30, 2024
[PHP] How to ingest my pdf file using PHP ES Client Elasticsearch	3	2175	July 5, 2017

Ingest Pipeline processing issue in for loop

Related topics