Quick XML Stripping Script
30 November -0001
by: Justin Klein Keane
June 8, 2005
I wrote this little perl script so I could strip out elements in an XML file without having to work with any XML parser. For instance, I have an xml file resembling:<collection> <member> <id>21</id> <name>foo</name> </member> <member> <id>35</id> <name>bar</name> </member> </collection>
And I just want to axe out all the <member> records with an id of 35. In reality my file was over 200,000 lines long and doing this sort of thing by hand was out of the question. The following script will search over the <member> records and examine the <id> element, if it matches a search string (which can be a regular expression) then the entire member record is written to 'failureFile.xml', otherwise the good record is written to the file 'output.xml'.
#! /usr/bin/perl # # Filename: shredXML.pl # Purpose: scan through elements and grab elements with certain property # value pairs and axe out the entire containing element # # Author: Justin C. Klein Keane # use strict; my $fileToShred = "good_member4import.xml"; #filename my $outputFile = "output.xml"; #output file my $failureFile = "failureFile.xml"; #stipped xml my $container = "<member>"; #open tag my $containerCloser = "</member>"; #close tag my $searcher = "(<id>35</id>)"; #search for this my $fileToRead = checkfile($fileToShred); my $fileToWrite = checkOutputFile($outputFile); my $failFile = checkOutputFile($failureFile); my @holderVar; #just some empty space to hold strings my $starter = 0; my $dogCatcher = 0; my $i = 0; my $x = 0; #read the input file while ( <$fileToRead> ) { my $thisLine = $_; chomp($thisLine); if ( $thisLine =~ m/$container/ ) { $starter = 1; #got a start (tag opened) } if ( $thisLine =~ m/$containerCloser/) { $starter = 2; #tag closed } #write the element into memory if ( $starter == 0 ) { $thisLine .= "\n"; print $fileToWrite $thisLine; } else { $holderVar[$i] = $thisLine . "\n"; $i++; } #does the element contain a 'hit' code? if so mark $holderVar to dump it if ( $thisLine =~ m/$searcher/ ) { $dogCatcher = 1; # print "got a hit\n"; } if ( $starter == 2 && $dogCatcher == 1) { #put this dog down my $endValue = scalar(@holderVar); for ($x=0;$x<$endValue;$x++) { # uncomment to debug: # my $tagWriter = "<!--- " . $holderVar[$x] . " --->"; # print $fileToWrite $tagWriter; print $failFile $holderVar[$x]; $holderVar[$x] = ""; } $dogCatcher = 0; $i=0; $starter = 0; } elsif ( $starter == 2 && $dogCatcher == 0) { #legit doggie, let him roam my $endValue = scalar(@holderVar); for ($x=0;$x<$endValue;$x++) { print $fileToWrite $holderVar[$x]; $holderVar[$x] = ""; } $i=0; $starter = 0; } } #subroutines sub checkfile { #checks the input file to make sure it's valid and can be opened my $file = $_[0]; if (length($file) == 0) {print "No input file specified.\n"; return 0;} my $theFile; if (! open($theFile, $file)) { logError("failed to open file '" . $file . "'. Check to see if it exists."); return 0; } else { return $theFile; } } sub checkOutputFile { #checks the output files to make sure they're valid my $file = $_[0]; my $openFile; my $status = (stat($file))[7]; if (! $status) { $status = 0;} if ( $status != 0) { open($openFile, ">>" . $file) or logError("Couldn't open output file for appending " . $file); return $openFile; } else { open($openFile, ">" . $file) or logError("Couldn't create new output file " . $file); return $openFile; } } sub logError { print $_[0]; }