Regex select all text between tags
You can use "<pre>(.*?)</pre>"
, (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.
As other commenters have suggested, if you're doing something complex, use a HTML parser.
RegEx to extract text between a HTML tag
Your comment shows that you have neglected to escape the backslashes in your regex string.
And if you want to match lowercase letters add a-z
to the character classes or use Pattern.CASE_INSENSITIVE
(or add (?i)
to the beginning of the regex)
"<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>"
If the tag contents may contain newlines, then use Pattern.DOTALL
or add (?s)
to the beginning of the regex to turn on dotall/singleline mode.
retrieve string between html tags using regex
its hacky as sin
But if you want to be able to find potentially different entries across a web page not just a single
where @search
is the name of the div and @toSearch
is the html content
declare @search nvarchar(200) = 'ExternalClass849E0BFE74914F8BB79B2C64E3D07AE6'
declare @toSearch nvarchar(4000)
= 'Lorem ipsum dolor sit amet.<div class="ExternalClass849E0BFE74914F8BB79B2C64E3D07AE6"><p>Just help me!</p></div>Lorem ipsum dolor sit amet.'
DECLARE @start int
,@end int
select @start = CHARINDEX('<p>',@toSearch,CHARINDEX(@search,@toSearch)) +3
select @end = CHARINDEX('</p>',@toSearch,CHARINDEX(@search,@toSearch))
select SUBSTRING(@toSearch,@start,@end - @start)
Regex that extracts text between tags, but not the tags
You can use this following Regex:
>([^<]*)<
or, >[^<]*<
Then eliminate unwanted characters like '<' & '>'
use RegEx to extract text between html tags
Parsing HTML with regex is not ideal. Others have suggested the HTML Agility Pack. However, if you can guarantee that your input is well-defined and you always know what to expect then using a regex is possible.
If you can make that guarantee, read on. Otherwise you need to consider the other suggestions or define your input better. In fact, you should define your input better regardless because my answer makes a few assumptions. Some questions to consider:
- Will the HTML be on one line or multiple lines, separated by newline characters?
- Will the HTML always be in the form of
<div>...<h2...>...</h2><h3...>...</h3></div>
? Or can there beh1-h6
tags? - On top of the
hN
tags, will the date and number always be between the tags withid-date
andnr
values for theid
attribute?
Depending on the answers to these questions the pattern can change. The following code assumes each HTML fragment follows the structure you shared, that it will have an h2
and h3
with date and number, respectively, and that each tag will be on a new line. If you feed it different input it will likely break till the pattern matches your input's structure.
Dim input As String = "<div id=""div"">" & Environment.Newline & _
"<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
"<h3 id=""nr"">000</h3>" & Environment.Newline & _
"</div>"
Dim pattern As String = "<div[^>]+>.*?" & _
"<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
"<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"
Dim m As Match = Regex.Match(input, pattern, RegexOptions.Singleline)
If m.Success Then
Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
Console.WriteLine("Actual Date: " & actualDate)
Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
Console.WriteLine("Actual Number: " & actualNumber)
Else
Console.WriteLine("No match!")
End If
The pattern can be on one line but I broke it up for clarity. RegexOptions.Singleline
is used to allow the .
metacharacter to handle \n
for newlines.
You also said:
Also and this will be in loop, meaning
there are more div block needed to be
parsed.
Are you looping over separate strings? Or are you expecting multiple occurrences of the above HTML structure in a single string? If the former, the above code should be applied to each string. For the latter you'll want to use Regex.Matches
and treat each Match
result similarly to the above piece of code.
EDIT: here is some sample code to demonstrate parsing multiple occurrences.
Dim input As String = "<div id=""div"">" & Environment.Newline & _
"<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
"<h3 id=""nr"">000</h3>" & Environment.Newline & _
"</div>" & _
"<div id=""div"">" & Environment.Newline & _
"<h2 id=""id-date"">09.14.2010</h2>" & Environment.Newline & _
"<h3 id=""nr"">123</h3>" & Environment.Newline & _
"</div>"
Dim pattern As String = "<div[^>]+>.*?" & _
"<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
"<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"
For Each m As Match In Regex.Matches(input, pattern, RegexOptions.Singleline)
Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
Console.WriteLine("Actual Date: " & actualDate)
Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
Console.WriteLine("Actual Number: " & actualNumber)
Next
Regex to get string between html tags: stop selection at the first match of closing tag
I found the solution.
<ul[^>]*id(\s)*=('|")navi_list(\s)*('|")>((.|\n|\r|(\n\r))*?)</ul>
The following regex will select all html between a tag and its closing counterpart.
<TAG\b[^>]*>((.|\n|\r|(\n\r))*?)</TAG>
Extract string between html tags
Try m.Groups[1].Value
(documentation for Groups), or m.Result("$1")
(documentation for Result); either should work.
The object m
which was returned by Regex.Match
is an object that contains various pieces of information about what was matched. This includes both the entire string that was matched, including in this case the title tags themselves, and the parts of the string matched by each group of parentheses. m.Value
gives the entire string; m.Groups[1].Value
gives the part matched by the first group, m.Groups[2].Value
gives the part matched by the second group, etc. This has to be done outside the regular expression because a program might want more than one group; for instance, if you're matching a time of day, like (\d+):(\d+)
, then you might want to assign the hours (m.Groups[1].Value
) to one variable and the minutes (m.Groups[2].Value
) to a different variable.
Related Topics
How to Convert Milliseconds to Time(Hh:Mm:Ss) in Oracle
Disable Secure Priv for Data Loading on MySQL
Combining (Concatenating) Date and Time into a Datetime
How to Import CSV Data into a Table Without Knowing the Columns of the Csv
Converting to Timestamp With Time Zone Failed on Athena
Update Only Time from My Datetime Field in SQL
How to Insert an Image in Sqlite Database(Table)
Mysql Function to Find the Number of Working Days Between Two Dates
How to Retrieve Records for Last 30 Minutes in Ms SQL
Make SQL Select Same Row Multiple Times
Phone Number Display Method, SQL Query
How to Get Column Name Based on Row Value in SQL Server
Postgresql Error: Fatal: Role "Username" Does Not Exist
Counting the Number of Rows Returned by Stored Procedure
Nodejs, MySQL - Json Stringify - Advanced Query