Extract Data from a Multilevel XML File
Problem description & analysis
We have an XML file invoice.xml. Below is part of the data:
…
<LIST_G_AMOUNT_DUE>
<G_AMOUNT_DUE>
<SORT_TRX_SEQUENCE>1</SORT_TRX_SEQUENCE>
<SORT_TRX_DATE>28-JUL-14</SORT_TRX_DATE>
<SORT_INVOICE_NUMBER>111820</SORT_INVOICE_NUMBER>
<SORT_DUE_DATE>27-AUG-14</SORT_DUE_DATE>
<PS_SEQUENCE>1493092</PS_SEQUENCE>
<TRX_SEQUENCE>1712368</TRX_SEQUENCE>
<RECEIPT_CURRENCY_CODE></RECEIPT_CURRENCY_CODE>
<AMOUNT_APPLIED_FROM></AMOUNT_APPLIED_FROM>
<LIST_G_LINE_CLUSTER>
<G_LINE_CLUSTER>
<INVOICE_NUMBER>111820</INVOICE_NUMBER>
<TRX_DATE>28-JUL-14</TRX_DATE>
<TRANSACTION>Invoice</TRANSACTION>
<DUE_DATE>27-AUG-14</DUE_DATE>
<REFERENCE>SAMPLE SALE </REFERENCE>
<BILL_TO_LOCATION>WASHINGTON</BILL_TO_LOCATION>
<LINE_CUSTOMER_ID>4382</LINE_CUSTOMER_ID>
<GENERAL_SEQUENCE>9648082</GENERAL_SEQUENCE>
<AMOUNT_DUE></AMOUNT_DUE>
<TRX_AMOUNT>64.4</TRX_AMOUNT>
<CD_TRX_AMOUNT> 64.40 </CD_TRX_AMOUNT>
<DUMMY_REFERENCE>SAMPLE SALE </DUMMY_REFERENCE>
<C_BILL_TO_LOC>0</C_BILL_TO_LOC>
</G_LINE_CLUSTER>
<G_LINE_CLUSTER>
<INVOICE_NUMBER>111820</INVOICE_NUMBER>
<TRX_DATE>19-OCT-15</TRX_DATE>
…
We are trying to extract data from the XML file. The expected result is as follows:
SORT_INVOICE_NUMBER |
TRANSACTION |
SORT_DUE_DATE |
TRX_DATE |
TRX_AMOUNT |
111820 |
Invoice |
27-AUG-14 |
28-JUL-14 |
64.4 |
111820 |
Payment |
27-AUG-14 |
19-OCT-15 |
-64.4 |
1100585 |
Invoice |
30-JUL-15 |
30-JUN-15 |
69.4 |
1100585 |
Payment |
30-JUL-15 |
05-AUG-15 |
-16.73 |
1100585 |
Payment |
30-JUL-15 |
09-SEP-15 |
-52.2 |
1100585 |
Payment |
30-JUL-15 |
19-OCT-15 |
-0.47 |
1101491 |
Invoice |
05-AUG-15 |
06-JUL-15 |
69.4 |
1101491 |
Payment |
05-AUG-15 |
19-OCT-15 |
-69.4 |
… |
… |
… |
… |
… |
Solution
Write the following script p1.dfx in esProc:
A |
|
1 |
=file("invoice.xml").read() |
2 |
=xml(A1,"G_STATEMENT/LIST_G_AMOUNT_DUE/G_AMOUNT_DUE") |
3 |
=A2.news(if(ifa(LIST_G_LINE_CLUSTER.G_LINE_CLUSTER),LIST_G_LINE_CLUSTER.G_LINE_CLUSTER,[LIST_G_LINE_CLUSTER.G_LINE_CLUSTER]);SORT_INVOICE_NUMBER,~.TRANSACTION,SORT_DUE_DATE,~.TRX_DATE,~.TRX_AMOUNT) |
Explanation:
A1 Read the XML file as a string.
A2 Parse A1’s XML string, get data at the level of G_STATEMENT/LIST_G_AMOUNT_DUE/G_AMOUNT_DUE, and return it as a table sequence.
A3 Extend the table sequence by a specific column to get the targe result. Note that LIST_G_LINE_CLUSTER.G_LINE_CLUSTER could be a record sequence, which should first be checked and then converted into a sequence if it is a record sequence.
Refer to How to Call an SPL Script in BIRT to learn about the method of integrating the SPL script with BIRT.
https://www.eclipse.org/forums/index.php/t/1076017/
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL