How to know the _id of a record externally?

I'm developing a log analysis system. The input are log files. I have an external Python program that reads the log files and decide whether a record (line) or the log files is "normal" or "malicious". I want to use Elasticsearch Update API to update the detection result ("normal" or "malicious") to Elasticsearch's index by adding a new column called "result". So I can see the result clearly via Kibana UI.

Simply speaking, my Python code and Elasticsearch both use log files as input respectively. Now I want to update the result from Python code to Elasticsearch. What's the best way to do it?

I can think of several ways:

  1. Elasticsearch automatically assigns a ID (_id) to a record. If I can find out how Elasticsearch calculates _id, then my Python code can calculate this id for each record, then update the record via _id. But the question is, the docs doesn't say about the algorithm to generate _id.

  2. Add ID (like line number) to the log files. Then use this ID to update. But I think I have to search for this ID for every time because it's only a normal field instead of a built-in _id. The performance will be very bad.

  3. My Python code gets the logs from Elasticsearch instead of reading the log files directly. But this makes the system fragile, as Elasticsearch becomes a critical point. The performance will also be degraded.

So the first solution will be ideal in the current view. Any suggestions? Thanks.

when the indexing request returns we also return you the _id there is no deterministic way to recalculare the ID. We take things like mac address and wall clock time into account

Thanks!

I finally used the 2nd way: add my own ID to each line of the log file. Then use this ID in both Elasticsearch and my Python program. Then Python can update the document based on the ID.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.