Go Back   CORTEX Forums > Best Practices > Tips and Techniques > Data Warehousing Tips and Techniques > DWH Tip Feeds
Register Blogs FAQ Members List Calendar Search Today's Posts Mark Forums Read

Using Tidy to clean webpages with Pentaho Data Integration

This is a discussion on Using Tidy to clean webpages with Pentaho Data Integration within the DWH Tip Feeds forums, part of the Data Warehousing Tips and Techniques category; Here's an issue I've come across multiple times: I need to scrape HTML websites to extract data. Pentaho Data Integration ( kettle ) has lots of functionality on-board to make ...


Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 31st May 2011, 04:17 AM   #1
News Bot
 
Join Date: Nov 2007
Posts: 15,067
Latest News Headlines is on a distinguished road
Post Using Tidy to clean webpages with Pentaho Data Integration

Here's an issue I've come across multiple times: I need to scrape HTML websites to extract data. Pentaho Data Integration (kettle) has lots of functionality on-board to make this an easy process, except one: it does not support reading data directly from HTML.

In this short post, I provide a simple tip to clean HTML pages and convert them to XML so you can extract its data using the conventional "Get data from XML" step. The solution hinges on two ingredients: Standard Kettle tools for Webservices

Kettle is really good at fetching data from the web and extracting data from webservices, be they in a SOAP/XML, REST/JSON or RSS flavor. (There is an extensive chapter on this subject in Pentaho Kettle Solutions). But when you're dealing with plain old HTML, things can get pretty hairy.

If you're lucky, the page may be in XHTML, and in that case it's worth trying the Get Data from XML step. However, quite often a webpage that claims to be XHTML is not well-formed XML, and even if it is, Kettle does not understand things like entities, which are valid in XHTML, but not in plain XML. And of course, more often than not, you're not lucky, and XHTML represents only a minor fraction of all the web pages out there.Workaround: JavaScript string manipulation

In the past, I usually worked around these issues. In practice, some quick and dirty string manipulation using the Modified Javascript Value step and some built-in indexOf(), substring and replace() functions go a long way.

In most cases I don't really need the entire web page, but only a ,
    or
      element in the . Excising only the interesting sections out of the page using plain string manipulation will often get rid of most of the cruft that prevents the data from being treated as XML. For example, if we only need to get the rows from a table with a particular id attribute, we can use a JavaScript snippet like this:
      //table we're looking for
      var startHandle = "";
      var startPosition= html.indexOf(startHandle);
      //look beyond the start tag to lose the invalid unquoted attributes
      startPosition += startHandle.length;

      //find where this table ends (lucky us, no nested table elements :)
      var endHandle = "";
      var endPosition = html.indexOf(endHandle, startPosition);

      //make a complete table fragment out of it again
      var table = "" + html.substring(startPosition, endPosition + endHandle.length);

      //replace nbsp entities, empty unclosed img elements, and value-less nowrap attributes
      table = table.replace(/ |]>|nowrap/ig, "");


      There are of course no guarantees that the sections you cut out like that are in fact well-formed XML, but in my experience it's often worth a try.A better way: using JTidy

      While the JavaScript workaround may just work for your particular case, it certainly has disadvantages. Sometimes it may just be not so simple to clean the HTML with plain string manipulation. And of course there is a performance issue too - the JavaScript step can be quite slow.

      Fortunately, there is a better way.

      Using a user-defined Java Class step we can have JTidy do the dirty work of cleaning the HTML and converting it to XML, which we can then process in a sane way with Kettle's Get Data from XML step.

      We need to do two things to make this work: first, you have to download JTidy, unzip it, and place the jtidy-r938.jar in the libext directory, which resides immediately in your kettle installation directory. (note that if you were running spoon, you need to restart it before it will be picked up). Second, you need a little bit of glue code for the User-defined Java class step so Kettle can use the Tidy class inside the jar. I came up with the following Java snippet to make it work:
      import org.w3c.tidy.Tidy;
      import java.io.StringReader;
      import java.io.StringWriter;

      protected Tidy tidy;

      public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)
      {
      //create and configure a Tidy instance
      tidy = new Tidy();
      tidy.setXmlOut(true);
      return parent.initImpl(stepMetaInterface, stepDataInterface);
      }

      public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
      Object[] r;
      //Get row from incoming stream.
      //Bail out if its not there.
      if ((r = getRow()) == null) {
      setOutputDone();
      return false;
      }

      //read the value of the html input field
      //the html field happens to be the 5th field in the stream,
      //because java arrays start at 0, we use index 4 to reference it
      StringReader html = new StringReader((String)r[4]);

      //use tidy to parse html to xml
      StringWriter xml = new StringWriter();
      tidy.parse(html, xml);

      //assign the xml to the output row
      //note we simply overwrite the original html field from the input row.
      r[4] = xml.toString();

      //push the output row to the outgoing stream.
      putRow(data.outputRowMeta, r);
      return true;
      }

      The big advantage of using Tidy is that you can be sure that the result is well-formed XML. In addition, you can have JTidy report on any errors or warnings, which makes it much more robust than any ad-hoc string manipulation you can come up with.


      More from Roland Bouman's Blog ...
Latest News Headlines is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Excellent book ? Pentaho 3.2 Data Integration : Beginner’s Guide Latest News Headlines Open Source News and Opinion 0 8th May 2010 10:43 AM
Book Review : Pentaho 3.2 Data Integration Latest News Headlines Data Integration News Feeds 0 7th May 2010 09:23 AM
Pentaho Dominates Open Source BI and Data Integration Adoption Latest News Headlines Other International Vendors 0 23rd December 2009 03:49 AM
Pentaho Dominates Open Source Business Intelligence and Data Integration Adoption Latest News Headlines Other International Vendors 0 9th December 2009 07:31 AM
Pentaho Data Integration: Javascript Step Performance Latest News Headlines DWH Tip Feeds 0 25th November 2009 11:03 AM


All times are GMT +11. The time now is 06:50 PM.

© The Business Intelligence Group

Search Engine Optimization by vBSEO