Delete html tags in sed or similar
sed 's/<[^>]\+>//g'
will strip all tags out, but you might want to replace them with a space so tags that are next to each other don't run together: <td>one</td><td>two</td>
becoming: onetwo
. So you could do sed 's/<[^>]\+>/ /g'
so it would output one two
(well, actually one two
).
That said unless you need just the raw text, and it sounds like you are trying to do some transformations to the data after stripping the tags, a scripting language like Perl might be a more fitting tool to do this stuff with.
As mu is too short mentioned scraping HTML can be a bit dicey, using something that actually parses the HTML for you would be the best way to do this. PHPs DOM API is pretty good for these kinds of things.
Sed remove tags from html file
You can either use one of the many HTML to text converters, use Perl regex if possible <.+?>
or if it must be sed
use <[^>]*>
sed -e 's/<[^>]*>//g' file.html
If there's no room for errors, use an HTML parser instead.
E.g. when an element is spread over two lines
<div
>Lorem ipsum</div>
this regular expression will not work.
This regular expression consists of three parts <
, [^>]*
, >
- search for opening
<
- followed by zero or more characters
*
, which are not the closing>
[...]
is a character class, when it starts with^
look for characters not in the class - and finally look for closing
>
The simpler regular expression <.*>
will not work, because it searches for the longest possible match, i.e. the last closing >
in an input line. E.g., when you have more than one tag in an input line
<name>Olaf</name> answers questions.
will result in
answers questions.
instead of
Olaf answers questions.
See also Repetition with Star and Plus, especially section Watch Out for The Greediness! and following, for a detailed explanation.
sed or other - remove specific html tag text from file
Should I be using the sed command for desired results?
Actually grep
suits it better with:
grep -Ev '</(body|html)>' file
Hello World
If you want to remove specific <body>\n</html>\n
string only then use this sed
that would work with any version of sed
:
sed '/<\/body>/{N; /<\/html>/ {N; s~</body>\n</html>\n~~;};}' file
Hello World
How to remove the html tags using sed
sed -n 's/<[^>]*>//gp' test.csv | sed '/^$/d'
You are almost there, the dot(.) you used could match a ">" character, so remove it from you command
the command after pipe is to clear all blank lines
Remove specific tag with its contents using sed
I suggest using a different sed separator than /
when /
is contained within the thing you want to match on. Also, prefer -E
instead of -r
for extended regex to be Posix compatible. Also note that you have a /
in your first span
in your regex that doesn't belong there.
Also, .*
will make it overly greedy and eat up any </span>
that follows the first </span>
on the line. It's better to match on [^<]*
. That is, any character that is not <
.
sed -E 's,<span class="the_class_name">[^<]*</span>,,g'
A better option is of course to use a HTML parser for this.
Remove empty HTML tags from a file using sed
You could use the below sed command to remove only the empty tags.
sed 's/<[^\/][^<>]*> *<\/[^<>]*>//g' file
Through Perl,
perl -pe 's/<([^<>]*)>\s*<\/\1>//g' file
Remove html tag using regex with sed
All of sed's regular expressions are look for the (left-most) longest match. Perl and others may support the form .*?
for non-greedy regexes but sed doesn't.
If you want to delete those lines, try:
sed '\|<p lang="en-US" class="western c31"></p>|d' hasil.html
d
is sed's delete command.
If you want to use a substitute command to remove only those tags, leaving behind whatever else, if anything, was on the line:
sed 's|<p lang="en-US" class="western c31"></p>||g' hasil.html
Related Topics
Browser Doesn't Scale Below 400Px
How May I Align Text to The Left and Text to The Right in The Same Line
How to Center The Twitter Bootstrap Tabs on The Page
What Does Img[Class*="Align"] Mean in CSS
Pseudo Element Not Full Container Width When Border Used
Bootstrap Dropdown Clipped by Overflow:Hidden Container, How to Change The Container
Controlling The Size of an Image Within a CSS Grid Layout
Enable Vertical Scrolling on Textarea
Flex Items Not Centering Vertically
Post Values from a Multiple Select
How to Change Text Selection Color in UIwebview iOS
HTML5 Canvas Scrolling Vertically and Horizontally
How to Have Perfectly Centered Navigation Bar with Equally Wide Links
Difference Between "Lang" and "Type" Attributes in a Script Tag