ClickAider

xml_grep cookbook

iLike frequently receives large XML files from our partners that we need to analyze.

For ad-hoc analysis of XML from command line, xml_grep is a really valuable tool. Instead of extracting lines that match regular expressions, xml_grep extracts xml nodes that match xpath expressions.

However, it took me a bit of trial and error to figure out how install and use it successfully. Here's a quick cookbook of how I did it; I'll update this post with additional useful examples as I encounter them.

Installation

In most *nix distributions, you can install by simple typing

sudo cpan -i "XML::Twig"

Answer 'y' when Cpan asks you if you want to install other packages that xml_grep depends on.

(xml_grep is included as a part of the XML::Twig package in CPAN. XML::Twig also provides a rich Perl API; however the command line examples here do not require any Perl knowledge.)

Usage examples

Suppose you have the following xml:

XML:
  1. <?xml version="1.0" ?>                                                                                                                     
  2. <Events>                                                                                                                                           
  3.   <Event ID="0E0042D2CF2E9E44">                                                                                                           
  4.     <ArtistIDs>                                                                                                                           
  5.       <ArtistID ID="806533" Type="Primary"/>                                                                                               
  6.       <ArtistID ID="1134688" Type="Secondary"/>                                                                                           
  7.     </ArtistIDs>
  8.     <EventDate>2009-07-25</EventDate>                                                                                           
  9.     <PerformanceName>Summer Splash 2009</PerformanceName>                                               
  10.   </Event>       
  11.    <!-- etc -->
  12. </Events>

xml_grep allows you to specify both --root (the node to match and print out) and --cond (an xpath expression relative to root that filters the results.) If more than one root is provided, the results are combined using OR; likewise with cond.

1. Here's the simplest possible example. --root is an xpath expression specifying which nodes will be printed. -p causes the results to be pretty printed:

Print all events : xml_grep -p --root="Event" foo.xml

2. The next example shows how we can use --cond to filter the results based on an attribute of a subnode of the root.

Print all events for a given artist: xml_grep -p --root="Event" --cond='ArtistID[@ID='806533']' foo.xml

3. Here's how we can match against the text contents of a subelements:

Print all events on a given date: xml_grep -p --root="Event" --cond='EventDate[string()="2009-07-25"]' foo.xml

Multiple cond flags are combined using OR. To print nodes matching one of several artists:

4.Print all events for a given artist: xml_grep -p --root="Event" --cond='ArtistID[@ID='806533']' --cond='ArtistID[@ID='806532']' foo.xml

5. The result of xml_grep is itself an XML document and can be piped into another xml_grep to do additional extraction.

Extract the time and performance name of events on a given date: xml_grep -p --root="Event" --cond='EventDate[string()="2009-07-25"]' foo.xml | xml_grep --root='EventTime' --root='PerformanceName'

5. The result of xml_grep is itself an XML document and can be piped into another xml_grep to do additional extraction.

6. Match a regular expression: xml_grep --root="EventName[string() =~ /Bil.* El.*/]" foo.xml

2 Comments so far
Leave a comment

Nice post, I’ll have to link to it.

Note that if the files aren’t too large to fit in memory, you can also use xml_grep2, which you can find for now at http://xmltwig.com/tool/ . It is based on XML::LibXML instead of XML::Twig and it offers a more complete XPath support, as well as a set of options more consistent with grep itself.

I am not sure which one is easier to use though, the fact that you have separate roots and conditions in xml_grep might feel more natural than xml_grep2’s pure XPath syntax (you need to combine root and cond in a single XPath expression).

Added an example of matching regular expressions.


Leave a comment

(required)

(required)