Regex Select All Text Between Tags

Regex select all text between tags

You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.

As other commenters have suggested, if you're doing something complex, use a HTML parser.

Regex that extracts text between tags, but not the tags

You can use this following Regex:

>([^<]*)<

or, >[^<]*<

Then eliminate unwanted characters like '<' & '>'

Regex match text between tags

/<b>(.*?)<\/b>/g

Regular expression visualization

Add g (global) flag after:

/<b>(.*?)<\/b>/g.exec(str)
//^-----here it is

However if you want to get all matched elements, then you need something like this:

var str = "<b>Bob</b>, I'm <b>20</b> years old, I like <b>programming</b>.";

var result = str.match(/<b>(.*?)<\/b>/g).map(function(val){
return val.replace(/<\/?b>/g,'');
});
//result -> ["Bob", "20", "programming"]

If an element has attributes, regexp will be:

/<b [^>]+>(.*?)<\/b>/g.exec(str)

Select text between 2 complete span tags using regex

Regex is not good way to find HTML tags. But this should work for you-

<\s*span[^>]*>(.*?)<\s*\/\s*span>

DEMO: https://regex101.com/r/vbLN9L/6

Regex to extract pure text within specific HTML tag

How real software engineers solve this problem: Use the right tool for the right job, i.e. don't use regexes to parse HTML

The most straightforward way is to use an HTML parsing library, since parsing even purely conforming XML with regex is extremely non-trivial, and handling all HTML edge cases is an inhumanly difficult task.


If your requirements are "you must use a regex library to pull innerHTML from a <p> element", I'd much prefer to split it into two tasks:

1) using regex to pull out the container element with its innerHTML. (I'm showing an example that only works for getting the outermost element of a known tag. To extract an arbitrary nested item you'd have to use some trick like https://blogs.msdn.microsoft.com/bclteam/2005/03/15/net-regular-expressions-regex-and-balanced-matching-ryan-byington/ to match the balanced expression)

2) using a simple Regex.Replace to strip out all tag content

let html = @"<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>"

for m in Regex.Matches(html, @"<p>(.*?)</p>") do
printfn "(%O)" (Regex.Replace(m.Groups.[1].Value, "<.*?>", ""))

(This is some strong text)
(This is some reallystrong text)

If you are constrained to a single "Regex.Matches" call, and you're okay with ignoring the possibility of nested <p> tags (as luck would have it, in conformant HTML you can't nest ps but this solution wouldn't work for a containing element like <div>) you should be able to do it with a nongreedy matching of a text part and a tag part wrapped up inside a <p>...</p> pattern. (Note 1: this is F#, but it should be trivial to convert to C#) (Note 2: This relies on .NET-flavored regex-isms like stackable group names and multiple captures per group)

let rx = @"
<p>
(?<p_text>
(?:
(?<text>[^<>]+)
(?:<.*?>)+
)*?
(?<text>[^<>]+)?
)</p>
"
let regex = new Regex(rx, RegexOptions.IgnorePatternWhitespace)
for m in regex.Matches(@"
<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
") do
printfn "p content: %O" m
for capture in m.Groups.["text"].Captures do
printfn "text: %O" capture

p content: <p>This is some <strong>strong</strong> text</p>
text: This is some
text: strong
text: text
p content: <p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
text: This is some
text: really
text: strong
text: text


Remember that both the above examples don't work that well on malformed HTML or cases where the same tag is nested in itsel

Regex - How can I select the text between some HTML tags right after a specific tag?

Use gm regex flags with the following regex pattern:

^<dt\s+class="prd_name">\s*<strong>\K.*?(?=<\/strong>)

https://regex101.com/r/fakRAE/1

Regex just keep the content between tags but select everything

An idea is to match what you don't want but capture what you need to \1

<script>[\s\S]*?<\/script>|((?:<(?!script)|[^<])[\s\S]*?)(?=<script|$)

See this demo at regex101

To not skip over an opening <script in the alternation either match a character, that is not < or match a < which is not followed by script by use of a lookahead until <script occurs or $ end.

RegEx for matching between any two HTML tags

A RegEx for that a string between any two HTML tags

(?![^<>]*>)(TEST\-TEXT)



Related Topics



Leave a reply



Submit