External libs in python scripts (lang-python module)

Hello

we are currently running an Elasticsearch cluster with a fairly complex enrichment pipeline.
The enrichment at the moment runs as a separate Python process, that pulls data from ES< processes it and puts it back.

The process performs different enrichments, such as splitting URLs into components, IP geolocation using maxmind's DB, MAC address' vendor resolution etc., all exploiting external data sources.

With the ingest nodes capability coming in ES 5, I would be interested to move all the enrichment part in an ingest node. However, not all of the functionality we implemented is currently available in an ingest pipeline.

My question is: if I had to write a script using the lang-python module to implement the missing functionality, for example URL splitting and MAC vendor resolution, would it be possible, from within the script to:

  • import python modules (with the usual 'sys.path.append('/module/path/')' and where should I put the modules code? $ES_HOME/config/script/modules ?

  • open and read files on disk, and where should I put them?

  • access external data sources, for example through an HTTP REST call? Or is the sandboxed code somehow forbidden to open connections?

Thanks in advance for any advice, I tried to search for this info, unfortunately results get mixed with questions about the python ES client library.

/V

It might possibly work but I'd avoid it because Python is deprecated in ES 5. As useful as being able to write stuff in a language that makes you comfortable is, we just can't give them all proper attention. In 5.0 we're really throwing all our support behind the new Painless language because it has a proper sandbox and is quite quick.

I'm sorry for the loss of flexibility but it makes our life so, so, so much more sane. So, yeah, we want these scripts to be properly sandboxed and unable to reach out of the server but Python never really had that ability.

I just checked the docks page for the Python plugin in 5.0.0 and it isn't clear that it is deprecated. We'll at least fix that.

Make vendor resolution sounds like a thing that'd be fairly easy to set up as an ingest plugin but I don't know if that'd be worth it for you because you already have the enrichment pipeline. It sounds like it'd be similar to geoip which we already have a plugin for.

good to know, at least I'll get some peace of mind before trying to fight against a jython sandbox (been there, done that, it's not pretty).

Reading around, I would assume that Painless can't really connect to external sources, even to perform some basic lookups, isn't it?
Our enrichment is pretty basic, the MAC vendor resolution, for example, it's just a lookup of the MAC address' first bytes agains a vendor dictionary, it's nothing more complex than the GeoIP resolution.

What we do differently though is that we perform the enrichment on fields based on regexes and wildcards, i.e. every field named *_mac_addr gets enriched with the MAC address resolution.

Reading from the current documentation it seems that the ingest processors can only target fields with a fixed name.

Anyway, judging from what you say, if there really isn't any chance of doing I/O in Painless, it seems more convenient to just write an ingest plugin ourselves, using something like the GeoIP plugin as a template. I bet it would take the same amount of effort than having to get a python script to work at this point.

Thanks for your answer!

Right. Intentionally. We embed painless all over the place and many of those places would perform super badly if they made blocking calls.

Doing ingest actions against a pattern seems like a decent idea to me. Would you like to open an issue in ES's github to talk about adding support for that?

Probably, and it wouldn't be unsupported in the next major version which is good.

If you end up writing an ingest plugin I'm fairly sure it wouldn't work super well for you to do IO either. You could load the database on startup and maybe use a FileWatcher to pick up any changes to the file while the node is running.

Sure thing, I'll file a ticket on Github trying to describe the approach we are currently implementing in my code. I'm fairly sure it'll be quiet powerful, especially when having to deal with lots of different and arbitrary data feeds.

That would be exactly what I need. I'm gonna check the geoip plugin code to see how it's done over there.
BTW, if this approach could be generalized and paired with the 'pattern matching field naming' discussed above, it would make for an even more flexible enrichment process.

Think of it as a generic key/value lookup enrichment module where:

  • One specifies a dictionary files containing the data (i.e. MAC address prefix -> vendor name)
  • It configures the field naming pattern to apply the enrichment to (i.e. field_name: '*_mac'), using a similar syntax to the dynamic mapping entries
  • The enricher performs a key/value lookup using the field value as key, and putting the retrieved data (if found) in the same field, or in a new one (in this case, regex capturing groups from the naming pattern could be used to specify the new field name)

It would save people from the need of developing a dedicated plugin from different forms of enrichment which ultimately would follow the exact same structure.

I believe the grok ingest pipeline has some pattern matching capabilities
but I havent looked into it too deeply to see if it has all the things you
mention. It might be a good start through.