ClickAider
You are currently browsing the Phil Bogle’s Blog weblog archives.

xml_grep cookbook

iLike frequently receives large XML files from our partners that we need to analyze.

For ad-hoc analysis of XML from command line, xml_grep is a really valuable tool. Instead of extracting lines that match regular expressions, xml_grep extracts xml nodes that match xpath expressions.

However, it took me a bit of trial and error to figure out how install and use it successfully. Here's a quick cookbook of how I did it; I'll update this post with additional useful examples as I encounter them.

Installation

In most *nix distributions, you can install by simple typing

sudo cpan -i "XML::Twig"

Answer 'y' when Cpan asks you if you want to install other packages that xml_grep depends on.

(xml_grep is included as a part of the XML::Twig package in CPAN. XML::Twig also provides a rich Perl API; however the command line examples here do not require any Perl knowledge.)

Usage examples

Suppose you have the following xml:

XML:
  1. <?xml version="1.0" ?>                                                                                                                     
  2. <Events>                                                                                                                                           
  3.   <Event ID="0E0042D2CF2E9E44">                                                                                                           
  4.     <ArtistIDs>                                                                                                                           
  5.       <ArtistID ID="806533" Type="Primary"/>                                                                                               
  6.       <ArtistID ID="1134688" Type="Secondary"/>                                                                                           
  7.     </ArtistIDs>
  8.     <EventDate>2009-07-25</EventDate>                                                                                           
  9.     <PerformanceName>Summer Splash 2009</PerformanceName>                                               
  10.   </Event>       
  11.    <!-- etc -->
  12. </Events>

xml_grep allows you to specify both --root (the node to match and print out) and --cond (an xpath expression relative to root that filters the results.) If more than one root is provided, the results are combined using OR; likewise with cond.

1. Here's the simplest possible example. --root is an xpath expression specifying which nodes will be printed. -p causes the results to be pretty printed:

Print all events : xml_grep -p --root="Event" foo.xml

2. The next example shows how we can use --cond to filter the results based on an attribute of a subnode of the root.

Print all events for a given artist: xml_grep -p --root="Event" --cond='ArtistID[@ID='806533']' foo.xml

3. Here's how we can match against the text contents of a subelements:

Print all events on a given date: xml_grep -p --root="Event" --cond='EventDate[string()="2009-07-25"]' foo.xml

Multiple cond flags are combined using OR. To print nodes matching one of several artists:

4.Print all events for a given artist: xml_grep -p --root="Event" --cond='ArtistID[@ID='806533']' --cond='ArtistID[@ID='806532']' foo.xml

5. The result of xml_grep is itself an XML document and can be piped into another xml_grep to do additional extraction.

Extract the time and performance name of events on a given date: xml_grep -p --root="Event" --cond='EventDate[string()="2009-07-25"]' foo.xml | xml_grep --root='EventTime' --root='PerformanceName'

5. The result of xml_grep is itself an XML document and can be piped into another xml_grep to do additional extraction.

6. Match a regular expression: xml_grep --root="EventName[string() =~ /Bil.* El.*/]" foo.xml

Tutorial: Opensocial Data Pipelining and Templates in the iLike Profile view on Orkut

Early this year, the Orkut team became aware that "a small subset of OpenSocial applications [were] being used to spread phishing attacks to Orkut users."

To solve this, the Orkut team phased out support for Flash and Javascript in profile views and required that app developers use Opensocial data pipelining and templates to specify profile markup.

Data pipelining is declarative way to specify a set of REST calls to fetch the JSON data for a page, these include both OpenSocial and arbitrary REST calls. Templates are an HTML markup language (akin to JSTL or RHTML) that is interpreted on the Opensocial servers to generate the page markup. It includes simple expressions, conditional, and looping constructs, bound against the JSON data from data pipelining.

Orkut is currently the only container that has implemented these features, which are are expected to be part of the Opensocial 0.9 spec.

The specs for these features are still evolving and some of the online tutorials and wiki entries are inconsistent and no longer work. I thought it might be helpful to document an working example (as of August 2009) based on the iLike profile view.

The first step is to require the opensocial-data and opensocial-templates features in the ModulePrefs.

XML:
  1. <Module>
  2.   <ModulePrefs>
  3.     <Require feature="opensocial-data"/>
  4.     <Require feature="opensocial-templates">
  5.       <Param name="process-on-server">true</Param>
  6.     </Require>

Next we declare the profile content...

XML:
  1. <Content type="html" view="profile">

...and specify the JSON requests required by the profile view using the os:HttpRequest declaration:

XML:
  1. <script type="text/os-data" xmlns:os="http://ns.opensocial.org/2008/markup">
  2.   <os:HttpRequest key="cache" authz="signed" href="http://philbo.dev.ilike.com/gadget/ilike_async_get_cache_key" format="json" signViewer="false" params="orkut_profile=true"/>   
  3.   <os:HttpRequest key="profile" authz="signed" href="http://philbo.dev.ilike.com/gadget/profile_tracks_json" params="synd=orkut&key=${cache.content.key}" format="json" signViewer="false"/> 
  4. </script>

The ilike_async_get_cache_key request returns JSON like this: {"key": 139222}.

The profile_tracks_json request returns JSON like this:
{"tracks":[{,"artist_name":"Bouncing Souls","name":"The Pizza Song"}, {"artist_name":"The Aquabats","name":"Pizza Day"}],"track_count":2}}.

Note how the second request includes query string parameters (${cache.content.key}) derived from the JSON data in the first request. In this case, the profile_tracks_json is set to have a very long cache lifetime; the key is a profile timestamp use to force a fresh version of the profile to be fetched when the profile changes. (The timestamp is maintained in memached and can be returned very quickly by the server.)

It is poorly documented that Orkut's os:HttpRequest wraps this content in a "content" hash wrapper, which is why we have to say "cache.content.key".

Now that we have the data, we declare the template that binds to the json data to generate the markup. Note the use of "if" and "repeat" attributes for conditionals and loops and the "${}" expressions; also note the required use of osx:NavigateToApp to link to canvas pages.

XML:
  1. <script type="text/os-template" xmlns:os="http://ns.opensocial.org/2008/markup" xmlns:osx= "http://ns.opensocial.org/2009/extensions">
  2.     <div id="profile_songs">
  3.       <div class="gadget_profile_header">
  4.         <osx:NavigateToApp params="{path:&quot;songs_ilike&quot;}">
  5.         See All
  6.         </osx:NavigateToApp>
  7.       <b>Songs iLike</b>
  8.     </div>
  9.  
  10.     <div if="${profile.content.track_count == 0}">
  11.       No songs
  12.     </div>
  13.  
  14.     <div  if="${profile.content.track_count> 0}" >
  15.       <ul>
  16.         <li repeat="${profile.content.tracks}" var="track">
  17.             <osx:NavigateToApp params="{&quot;track_name&quot;:&quot;${track.name}&quot;,&quot;artist_name&quot;:&quot;${track.artist_name}}">
  18.               <span class="song_title">
  19.                   ${track.name}
  20.               </span>
  21.             </osx:NavigateToApp>           
  22.         </li>
  23.       </ul>
  24.     </div>
  25.    </div>
  26. </script>

In our case, the template is itself generated using an RHTML template in the Rails framework. We can use Ruby expressions to help simplify generation of the static template, but not, of course, any of the dynamic content or conditionals that depend upon the JSON data.

For example, the grungy escaped NavigateToApp parameters are actually generated by this more readable RHTML code:

XML:
  1. <osx:NavigateToApp params="<%= escape_json('path' => 'track_page', 'autoplay' => true, 'artist_name' => '${track.artist_name}', 'track_name' => '${track.name}') %>">