I have multiple agents ingerated with multiple type of applications using multiple methods like nodes js agents, python, java and some of the application are ingerated using open telemetry .
I want to have transaction.name in my span doucment instead of just transaction.id,
Would using OpenTelmetry's attribute mapping work for you? Here is an example of what this could look like:
from opentelemetry import trace
from opentelemetry.trace import SpanKind
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("name-of-transaction", kind=SpanKind.SERVER) as span:
No i have not tried at opentelemetry as i am looking for one solution to do on all the apm's via elasticsearch itself not on code bases either its ingtegrated opentelementry, nodes js, python or java etc from elasticsearch end not from any code level.
But i will try that on opentelemetry as well but there others aswell how we are going to enable on them.
Thanks, @kishorkumar. Thanks for providing that additional context. From an Elasticsearch prespective, have you considered using an ingest pipeline or transforms for this purpose?
Hi,
I have tried via **Transform ** it is looks impossible as span and transaction in same index but different document.
Yes, In ingest pipleline it was possible via enrich processor and enrichment policy but as we have the datastream and we will use index:apm-traces-* in enrichment_policy and it will through the exception of large data - circuit_breaking_exception
If and we can the month or the day in our index-pattern by as traces-apm-{{year}}-{{month}}-{{day}}-* then only enrichment policy will excute and we might get the results.
Have you ever come across with this kind of use case to get one document which merge the fields of 2 different doucments fields from same datastream index via transform or any method?
Yes I have tried Ingest pipeline with enrich processor without max match attribute and to utilize the enrich processor we need to create enrichment policy with that name and when we try to execute that policy via _policy then it will give us the exception of data too large.
and when_execute the policy it shows task is complete=false and after some time
{
"error": {
"root_cause": [
{
"type": "resource_not_found_exception",
"reason": "task [jg:619029220] isn't running and hasn't stored its results"
}
],
"type": "resource_not_found_exception",
"reason": "task [jg:619029220] isn't running and hasn't stored its results"
},
"status": 404
}
and when we provide the particualr index it will run but that's not we need becasue we are looking for general solution in the given solution we will face the issue of changing the months every time and even and if go with index base then its also needs to be changed after every rollover.
I'm interested in testing this to get to the bottom of this. I tried testing with smaller samples, but I'd like to know your dataset's size. Please provide further context on how your data is structured and how I could recreate it.
It's standard APM data from 20+ applications, similar to the transactions data on demo.elastic.co. The only difference is that I have a much larger data volume, around 200GB to 400GB per day, without any replicas.
I have a 30-day retention policy, where data older than 14 days moves to the frozen tier.
In total, this means: 14 × 400GB = 5.46TB
The applications include Java, Python, and Node.js. Since this data is highly sensitive, I can't share it, but you can use the given size estimates. The data structure is the same as demo.elastic.co.
I'm a bit confused about how it will work. Specifically:
Can a runtime field fetch or reference data (like transactions name) from other documents like span and transaction are two different records not same?
will runtime display in cases?
I tried running the GET apm-traces*/_search query that you have shared above, But I don't see the transaction name in the returned document.
Also, I'have around 700,000 to 800,000 documents per minute of spans only.
but i don't find any solution to do thorugh the elasticsearch itself.
I'm considering using Logstash with a simple pipeline (Elasticsearch input and filter), but I need help understanding how to properly configure it to handle this scale of data ingestion.
Thanks, @kishorkumar. I was thinking of Logstash as well, but I did wonder about the scale of your data. Do you have a code example of what you were considering?
I have now reduced the number of span by filterout records to approximately 350,000 per minute.
Given this ingestion rate 350,000 per minute, I need to estimate the optimal number of Logstash workers, keeping in mind that it should not exceed the number of CPU cores on the server or machine.
In addition, I need guidance on tuning:
Batch size
Queue size
Other related performance parameters
Also, I want to ensure that the JVM heap size (JM) for Logstash is configured correctly — ideally, it should be no less than 4 GB and no more than 8 GB. I don’t consider this to be a very high load, but when filters are applied, processing time may increase, so optimization is necessary.
Finally, I need to match the number of ingested records with what is actually indexed into Elasticsearch via APM, ensuring consistency between what is fetched and what is stored.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.