JavaScript EditorFreeware JavaScript Editor     Perl Tutorials 



Main Page Previous Section Next Section

Recipe 20.19 Extracting Table Data

20.19.1 Problem

You have data in an HTML table, and you would like to turn that into a Perl data structure. For example, you want to monitor changes to an author's CPAN module list.

20.19.2 Solution

Use the HTML::TableContentParser module from CPAN:

use HTML::TableContentParser;

$tcp = HTML::TableContentParser->new;
$tables = $tcp->parse($HTML);

foreach $table (@$tables) {
  @headers = map { $_->{data} } @{ $table->{headers} };
  # attributes of table tag available as keys in hash
  $table_width = $table->{width};

  foreach $row (@{ $tables->{rows} }) {
    # attributes of tr tag available as keys in hash
    foreach $col (@{ $row->{cols} }) {
      # attributes of td tag available as keys in hash
      $data = $col->{data};
    }
  }
}

20.19.3 Discussion

The HTML::TableContentParser module converts all tables in the HTML document into a Perl data structure. As with HTML tables, there are three layers of nesting in the data structure: the table, the row, and the data in that row.

Each table, row, and data tag is represented as a hash reference. The hash keys correspond to attributes of the tag that defined that table, row, or cell. In addition, the value for a special key gives the contents of the table, row, or cell. In a table, the value for the rows key is a reference to an array of rows. In a row, the cols key points to an array of cells. In a cell, the data key holds the HTML contents of the data tag.

For example, take the following table:

<table width="100%" bgcolor="#ffffff">
  <tr>
    <td>Larry &amp; Gloria</td>
    <td>Mountain View</td>
    <td>California</td>
  </tr>
  <tr>
    <td><b>Tom</b></td>
    <td>Boulder</td>
    <td>Colorado</td>
  </tr>
  <tr>
    <td>Nathan &amp; Jenine</td>
    <td>Fort Collins</td>
    <td>Colorado</td>
  </tr>
</table>

The parse method returns this data structure:

[
  {
    'width' => '100%',
    'bgcolor' => '#ffffff',
    'rows' => [
               {
                'cells' => [
                            { 'data' => 'Larry &amp; Gloria' },
                            { 'data' => 'Mountain View' },
                            { 'data' => 'California' },
                           ],
                'data' => "\n      "
               },
               {
                'cells' => [
                            { 'data' => '<b>Tom</b>' },
                            { 'data' => 'Boulder' },
                            { 'data' => 'Colorado' },
                           ],
                'data' => "\n      "
               },
               {
                'cells' => [
                            { 'data' => 'Nathan &amp; Jenine' },
                            { 'data' => 'Fort Collins' },
                            { 'data' => 'Colorado' },
                           ],
                'data' => "\n      "
               }
              ]
  }
]

The data tags still contain tags and entities. If you don't want the tags and entities, remove them by hand using techniques from Recipe 20.6.

Example 20-11 fetches a particular CPAN author's page and displays in plain text the modules they own. You could use this as part of a system that notifies you when your favorite CPAN authors do something new.

Example 20-11. Dump modules for a particular CPAN author
  #!/usr/bin/perl -w
  # dump-cpan-modules-for-author - display modules a CPAN author owns
  use LWP::Simple;
  use URI;
  use HTML::TableContentParser;
  use HTML::Entities;
  use strict;
  our $URL = shift || 'http://search.cpan.org/author/TOMC/';
  my $tables = get_tables($URL);
  my $modules = $tables->[4];    # 5th table holds module data
  foreach my $r (@{ $modules->{rows} }) {
    my ($module_name, $module_link, $status, $description) = 
        parse_module_row($r, $URL);
    print "$module_name <$module_link>\n\t$status\n\t$description\n\n";
  } 
  sub get_tables {
    my $URL = shift;
    my $page = get($URL);
    my $tcp = new HTML::TableContentParser;
    return $tcp->parse($page);
  }
  sub parse_module_row {
    my ($row, $URL) = @_;
    my ($module_html, $module_link, $module_name, $status, $description);
    # extract cells
    $module_html = $row->{cells}[0]{data};  # link and name in HTML
    $status      = $row->{cells}[1]{data};  # status string and link
    $description = $row->{cells}[2]{data};  # description only
    $status =~ s{<.*?>}{  }g; # naive link removal, works on this simple HTML
    # separate module link and name from html
    ($module_link, $module_name) = $module_html =~ m{href="(.*?)".*?>(.*)<}i;
    $module_link = URI->new_abs($module_link, $URL); # resolve relative links
    # clean up entities and tags
    decode_entities($module_name);
    decode_entities($description);
    return ($module_name, $module_link, $status, $description);
  }

20.19.4 See Also

The documentation for the CPAN module HTML::TableContentParser; http://search.cpan.org

    Main Page Previous Section Next Section
    


    JavaScript EditorJavaScript Verifier     Perl Tutorials


    ©