Friday, August 10, 2007

Column-oriented data extraction

Perl is a terrific language for text processing, but several readers have written wondering about how to extract columns of data from text files with Perl.

For instance, when you have a text database that looks like this:

      simpson bart springfield
      flinstonefred bedrock
      rubble barney bedrock
what's the best way to extract the last name from each database record?

When every record is separated by a delimiter (such as a ',', ':', or '|' symbol) it's easy to use the split function. But in a case like this, where data is stored in fixed-width fields, and no field delimiter is used, how do you extract your data?

It's still not too hard

Although you can't jump right in and easily use the split function, the solution is almost as easy, though maybe not as obvious. To solve the "fixed-width field" problem, just use Perl's substr function.

Using our three line database above as an example, it appears that the fields of the database are defined like this:

    Field Number Field Description Field Width
    1 Last name 9
    2 First name 9
    3 City 11
Given these definitions, here's how I'd grab the data from each record from a file named database:
    open (IN, 'database');

    while () {

    chomp;

    $last_name = substr $_, 0, 9; # extract the last name field
    $first_name = substr $_, 9, 9; # extract the first name field
    $city = substr $_, 18, 11; # extract the city field

    # do something useful here with each value ...

    print "$first_name $last_name lives in $city\n";

    }

    close (IN);

The substr() function

As you can see from this code fragment, the substr function is an important part of the solution. The substr function lets you extract fields of information from strings. All you have to do is define three items:
    1. The name of the string
    2. The offset, or starting point
    3. The length of the substring to extract
In the example above, the name of the string we're extracting information from is $_, the default string when you're reading from a data file like this. This string holds the value of each record that we're reading from the database file.

The offset is really just the starting point of each field. In our sample database, the last_name field begins in column 1, the first_name field begins in column 10, and the city field begins in column 19. Given these values, the offset for each field is 0, 9, and 18, respectively. (Remember that Perl, like C, C++, Java, and other languages, begin counting at 0. Therefore, if a field starts in column 10, it's really going to be the 9th element of a string; columns 1-9 of the database will be stored in the string elements numbered 0 through 8.)

The length of each substring is also defined by the database. In our example, the lengths of the three fields are the same as the field widths, 9, 9, and 11, respectively.

2 comments:

Web Data Extraction said...

Data Extraction can be defined as the process of retrieving Data from an unstructured source in order to further process it or store it. It is extremely useful for large organizations who deal with huge amounts of data, on a daily basis, that need to be processed into meaningful information and stored for later use. Thanks a lot.

Carly Fiorina said...

Hello Friends,

Data extraction is the act or process of retrieving data out of data sources for further data processing or data storage. The import into the intermediate extracting system is thus usually followed by data transformation and possibly the addition of meta data prior to export to another stage in the data workflow. Usually, the term data extraction is applied when data is first imported into a computer from primary sources, like measuring or recording devices. Thanks.....
Web Page Scraping