xml_grep cookbook
iLike frequently receives large XML files from our partners that we need to analyze.
For ad-hoc analysis of XML from command line, xml_grep is a really valuable tool. Instead of extracting lines that match regular expressions, xml_grep extracts xml nodes that match xpath expressions.
However, it took me a bit of trial and error to figure out how install and use it successfully. Here's a quick cookbook of how I did it; I'll update this post with additional useful examples as I encounter them.
Installation
In most *nix distributions, you can install by simple typing
sudo cpan -i "XML::Twig"
Answer 'y' when Cpan asks you if you want to install other packages that xml_grep depends on.
(xml_grep is included as a part of the XML::Twig package in CPAN. XML::Twig also provides a rich Perl API; however the command line examples here do not require any Perl knowledge.)
Usage examples
Suppose you have the following xml:
-
<?xml version="1.0" ?>
-
<Events>
-
<Event ID="0E0042D2CF2E9E44">
-
<ArtistIDs>
-
<ArtistID ID="806533" Type="Primary"/>
-
<ArtistID ID="1134688" Type="Secondary"/>
-
</ArtistIDs>
-
<EventDate>2009-07-25</EventDate>
-
<PerformanceName>Summer Splash 2009</PerformanceName>
-
</Event>
-
<!-- etc -->
-
</Events>
xml_grep allows you to specify both --root (the node to match and print out) and --cond (an xpath expression relative to root that filters the results.) If more than one root is provided, the results are combined using OR; likewise with cond.
1. Here's the simplest possible example. --root is an xpath expression specifying which nodes will be printed. -p causes the results to be pretty printed:
Print all events : xml_grep -p --root="Event" foo.xml
2. The next example shows how we can use --cond to filter the results based on an attribute of a subnode of the root.
Print all events for a given artist: xml_grep -p --root="Event" --cond='ArtistID[@ID='806533']' foo.xml
3. Here's how we can match against the text contents of a subelements:
Print all events on a given date: xml_grep -p --root="Event" --cond='EventDate[string()="2009-07-25"]' foo.xml
Multiple cond flags are combined using OR. To print nodes matching one of several artists:
4.Print all events for a given artist: xml_grep -p --root="Event" --cond='ArtistID[@ID='806533']' --cond='ArtistID[@ID='806532']' foo.xml
5. The result of xml_grep is itself an XML document and can be piped into another xml_grep to do additional extraction.
Extract the time and performance name of events on a given date: xml_grep -p --root="Event" --cond='EventDate[string()="2009-07-25"]' foo.xml | xml_grep --root='EventTime' --root='PerformanceName'
5. The result of xml_grep is itself an XML document and can be piped into another xml_grep to do additional extraction.
6. Match a regular expression: xml_grep --root="EventName[string() =~ /Bil.* El.*/]" foo.xml
2 Comments so far
Leave a comment
Nice post, I’ll have to link to it.
Note that if the files aren’t too large to fit in memory, you can also use xml_grep2, which you can find for now at http://xmltwig.com/tool/ . It is based on XML::LibXML instead of XML::Twig and it offers a more complete XPath support, as well as a set of options more consistent with grep itself.
I am not sure which one is easier to use though, the fact that you have separate roots and conditions in xml_grep might feel more natural than xml_grep2’s pure XPath syntax (you need to combine root and cond in a single XPath expression).
By mirod on 08.18.09 12:01 pm
Added an example of matching regular expressions.
By philbo on 08.18.09 6:03 pm
Leave a comment