2015-09-01 17:27:08 8 Comments

I'm try to automate the extraction of a transcript found on a website. The entire transcript is found between dl tags since the site formatted the interview in a description list. The script I have below allows me to search the site and extract the text in a plain-text format, but I'm actually looking for it to include everything between the dl tags, meaning dd's, dt's, etc. This will allow us to develop our own CSS for the interview.

Something to note about the page is that there are break statements inserted at various points during the interview. Some tools we've found that extract information from webpages using pairings have found this to be a problem since it only grabs the information up until the break statement. Just something to keep in mind if you point me in a different direction. Here's what I have so far.

#!/usr/bin/perl -w

use strict;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;

my $mech = WWW::Mechanize->new();

# find all <dl> tags
my @list = $mech->find('dl');

foreach ( @list ) {
print $_->as_text();

If there is a tool that essentially prints what I have, only this time as HTML, please let me know of it!


@Tim 2015-09-01 17:46:36

Your code is fine, just change the as_text() method to as_HTML() and it will show the content with HTML tags included.

Related Questions

Sponsored Content

40 Answered Questions

[SOLVED] How to fix a locale setting warning from Perl?

  • 2010-03-23 12:27:18
  • xain
  • 540965 View
  • 614 Score
  • 40 Answer
  • Tags:   perl locale

1 Answered Questions

[SOLVED] Perl WWW::Mechanize JSESSION issue

1 Answered Questions

[SOLVED] Perl WWW-Mechanize Module

1 Answered Questions

3 Answered Questions

[SOLVED] Perl www::mechanize

1 Answered Questions

[SOLVED] perl www mechanize and JSON

1 Answered Questions

[SOLVED] Perl WWW::Mechanize Web Spider. How to find all links

2 Answered Questions

[SOLVED] WWW::Mechanize and strawberry perl

1 Answered Questions

[SOLVED] Passing mechanized browser to subroutine (Perl with WWW::Mechanized)

Sponsored Content