Delete HTML Tags in Sed or Similar

Delete html tags in sed or similar

sed 's/<[^>]\+>//g' will strip all tags out, but you might want to replace them with a space so tags that are next to each other don't run together: <td>one</td><td>two</td> becoming: onetwo. So you could do sed 's/<[^>]\+>/ /g' so it would output one two (well, actually one two).

That said unless you need just the raw text, and it sounds like you are trying to do some transformations to the data after stripping the tags, a scripting language like Perl might be a more fitting tool to do this stuff with.

As mu is too short mentioned scraping HTML can be a bit dicey, using something that actually parses the HTML for you would be the best way to do this. PHPs DOM API is pretty good for these kinds of things.

Sed remove tags from html file

You can either use one of the many HTML to text converters, use Perl regex if possible <.+?> or if it must be sed use <[^>]*>

sed -e 's/<[^>]*>//g' file.html

If there's no room for errors, use an HTML parser instead.
E.g. when an element is spread over two lines

<div
>Lorem ipsum</div>

this regular expression will not work.


This regular expression consists of three parts <, [^>]*, >

  • search for opening <
  • followed by zero or more characters *, which are not the closing >
    [...] is a character class, when it starts with ^ look for characters not in the class
  • and finally look for closing >

The simpler regular expression <.*> will not work, because it searches for the longest possible match, i.e. the last closing > in an input line. E.g., when you have more than one tag in an input line

<name>Olaf</name> answers questions.

will result in

answers questions.

instead of

Olaf answers questions.

See also Repetition with Star and Plus, especially section Watch Out for The Greediness! and following, for a detailed explanation.

sed or other - remove specific html tag text from file

Should I be using the sed command for desired results?

Actually grep suits it better with:

grep -Ev '</(body|html)>' file

Hello World

If you want to remove specific <body>\n</html>\n string only then use this sed that would work with any version of sed:

sed '/<\/body>/{N; /<\/html>/ {N; s~</body>\n</html>\n~~;};}' file

Hello World

How to remove the html tags using sed

sed -n 's/<[^>]*>//gp' test.csv | sed '/^$/d'

You are almost there, the dot(.) you used could match a ">" character, so remove it from you command

the command after pipe is to clear all blank lines

Remove specific tag with its contents using sed

I suggest using a different sed separator than / when / is contained within the thing you want to match on. Also, prefer -E instead of -r for extended regex to be Posix compatible. Also note that you have a / in your first span in your regex that doesn't belong there.
Also, .* will make it overly greedy and eat up any </span> that follows the first </span> on the line. It's better to match on [^<]*. That is, any character that is not <.

sed -E 's,<span class="the_class_name">[^<]*</span>,,g'

A better option is of course to use a HTML parser for this.

Remove empty HTML tags from a file using sed

You could use the below sed command to remove only the empty tags.

sed 's/<[^\/][^<>]*> *<\/[^<>]*>//g' file

Through Perl,

perl -pe 's/<([^<>]*)>\s*<\/\1>//g' file

Remove html tag using regex with sed

All of sed's regular expressions are look for the (left-most) longest match. Perl and others may support the form .*? for non-greedy regexes but sed doesn't.

If you want to delete those lines, try:

sed '\|<p lang="en-US" class="western c31"></p>|d' hasil.html

d is sed's delete command.

If you want to use a substitute command to remove only those tags, leaving behind whatever else, if anything, was on the line:

sed 's|<p lang="en-US" class="western c31"></p>||g' hasil.html


Related Topics



Leave a reply



Submit