ScrubyT 0.2.6 released
The authors of scrubyt just announced the release of scrubyt 0.2.6.
scrubyT is a Ruby scraping framework built on top of Hpricot and Mechanize; it’s most interesting feature is the ability to automatically derive XPath extraction expressions based on a “training” parser that includes specific examples of phrases to extract from pages. The new version includes some valuable improvements, including automatic crawling of detail pages and regex-based specification of example data.
We have just released the new version, 0.2.6 with some great new features, tons of bugfixes and lot of changes overall which should greatly affect the reliability of the system.
A lot of long-awaited features have been added: most notably, automatic crawling to the detail pages, which was the most requested feature in scRUBYt!’s history ever. I will add a tutorial and detailed example on how to use this feature, which enables you to easily crawl a whole site.
Another great addition is the improved example generation - you don’t have to use the whole text of the element you would like to match anymore - it is enough to specify a substring, and the first element that contains the string will be returned. Moreover, you can use also regular expressions, in which case the first element with text content matching the regexp will be returned. If this still won’t be enough, it is possible to create a compound example like this:
flight :begins_with => 'Arrival', :contains /d{4}/, :ends_with => '20:00'I guess it’s quite intuitive how should this work.
We have finished to fix an enormous amount of bugs and tested the whole system thoroughly, so the overall reliability should be improved a lot as opposed to the previous releases.
If you have any comments, questions, suggestions, please visit the brand new forum!