Breakdown xml to nested object

Hi,

i am trying to break down PubMed Articles with Logstash, but i am not able to think of any filter suitable to break down the following xml structure. (Please also note that within the abstracttext xml element there are other strings that look like xml, e.g. "T. rubrum", so i am not able to use the xml filter here because i want the text as is).

What i would like to have after parsing with Logstash is a nested object structure that is suitable to index it in elasticsearch as a nested type.

{ Abstracts:
[
{
"Label" = Introduction
"NlmCategory" = UNASSIGNED
"AbstractText" = "...."
}
...
]

Below it the original input:

    <Abstract>
      <AbstractText Label="Introduction" NlmCategory="UNASSIGNED">Superficial mycosis is one of the most common diseases worldwide, however its epidemiology is changing over time.</AbstractText>
      <AbstractText Label="Aim" NlmCategory="UNASSIGNED">To present epidemiological data of the skin fungal infections diagnosed in the years 2011-2016 in Lower Silesia.</AbstractText>
      <AbstractText Label="Material and methods" NlmCategory="UNASSIGNED">A total of 11 004 patients with a clinically suspected superficial mycosis were investigated. Skin scrapings, nail clippings and plucked hair were examined with a direct microscopy, Wood's lamp and culture. Particular species were identified via polymerase chain reaction (PCR) examination. The lesions suspected for pityriasis versicolor were screened for <i>Malassezia</i> with Wood's lamp and direct microscopy.</AbstractText>
      <AbstractText Label="Results" NlmCategory="UNASSIGNED">Dermatomycosis was diagnosed in 1653 (15.00%) patients with 1795 fungi identified. 1858 specimens were indicative of fungal infection including dermatophytes, yeasts and moulds. Out of 924 cases of dermatophytic infections (51.48%), <i>Trichophyton rubrum</i> accounted for the majority (71.75%) and was followed by <i>Trichophyton tonsurans</i> (16.77%). Among the yeasts (716; 39.89%), <i>Candida</i> spp. was the most common agent identified (521; 67.66%). The sites affected most often were toenails (956; 51.45%) and fingernails (319; 17.17%). In paediatric population the most common diagnosis was <i>tinea corporis</i> (60, 41.10%).</AbstractText>
      <AbstractText Label="Conclusions" NlmCategory="UNASSIGNED">Our study revealed that toenail onychomycosis remains the most common superficial mycosis and <i>T. rubrum</i> is the most common pathogen. However, in a longer period of observation, a decrease in the number of <i>tinea capitis</i> cases and an increase in infections caused by <i>T. tonsurans</i> were noticed. Observed changes indicate the need for continuing studies to detect the upcoming epidemiological trends.</AbstractText>
      <CopyrightInformation>Copyright: © 2018 Termedia Sp. z o. o.</CopyrightInformation>
    </Abstract>

Would be highly appreciated if someone has a suggestion on how to deal with this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.