html - Match any character (including newlines) in sed

Question

Welcome To Ask or Share your Answers For Others

html - Match any character (including newlines) in sed

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

html - Match any character (including newlines) in sed

I have a sed command that I want to run on a huge, terrible, ugly HTML file that was created from a Microsoft Word document. All it should do is remove any instance of the string

style='text-align:center; color:blue;
exampleStyle:exampleValue'

The sed command that I am trying to modify is

sed "s/ style='[^']*'//" fileA > fileB

It works great, except that whenever there is a new line inside of the matching text, it doesn't match. Is there a modifier for sed, or something I can do to force matching of any character, including newlines?

I understand that regexps are terrible at XML and HTML, blah blah blah, but in this case, the string patterns are well-formed in that the style attributes always start with a single quote and end with a single quote. So if I could just solve the newline problem, I could cut down the size of the HTML by over 50% with just that one command.

In the end, it turned out that Sinan ünür's perl script worked best. It was almost instantaneous, and it reduced the file size from 2.3 MB to 850k. Good ol' Perl...

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:38:03+0000

sed goes over the input file line by line which means, as I understand, what you want is not possible in sed.

You could use the following Perl script (untested), though:

#!/usr/bin/perl

use strict;
use warnings;

{
    local $/; # slurp mode
    my $html = <>;
    $html =~ s/ style='[^']*'//g;
    print $html;
}

__END__

A one liner would be:

$ perl -e 'local $/; $_ = <>; s/ style=47[^47]*47//g; print' fileA > fileB

Categories

html - Match any character (including newlines) in sed

html - Match any character (including newlines) in sed

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags