Regex to find words between two tags
You can use BeautifulSoup
for this HTML parsing.
input = """"<person>John</person>went to<location>London</location>"""
soup = BeautifulSoup(input)
print soup.findAll("person")[0].renderContents()
print soup.findAll("location")[0].renderContents()
Also, it's not a good practice to use str
as a variable name in python as str()
means a different thing in python.
By the way, the regex can be:
import re
print re.findall("<person>(.*?)</person>", input, re.DOTALL)
print re.findall("<location>(.*?)</location>", input, re.DOTALL)
How can I use regex in python to find words between tags?
You may use re.findall
in dot all mode:
var2 = """<tag> variable
number
two </tag>"""
matches = re.findall(r'<tag>(.+?)</tag>', var2, flags=re.DOTALL)
print(matches)
This prints:
[' variable\nnumber \ntwo ']
By the way, if you expect things like nested tags or other nested content, you should consider learning how to work with Python's Beautiful Soup library, which is more geared towards parsing HTML than regular expressions. That is, in general you should not use regex to parse HTML.
Regex select all text between tags
You can use "<pre>(.*?)</pre>"
, (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.
As other commenters have suggested, if you're doing something complex, use a HTML parser.
Regex that matches a specific word between tags
If the word Task
can occur multiple times inside the tag, you can use a lookbehind:
(?<=<t[12]:ExecuteTask[^><]+)\bTask
[^><]+
negated class to match one or more characters, that're not>
,<
for staying inside.\b
matches a word boundary.
See demo at regexstorm
The lookbehind is a zero-width assertion which looks right before the word Task
behind if it's preceded by one or more characters that are not >
or <
, preceded by <t[12]:ExecuteTask
.
If you expect the word to occure only once inside, it's more efficient to use a capturing group
(<t[12]:ExecuteTask[^><]+?)\bTask
and replace with $1TaskID
(where $1
matches what's captured by first group).
See another demo at regexstorm
Regex: match multiple times between tags
You may try this:
def(?=[^<>]*?<\/)
Explanation:
def
matchesdef
(?=[^<>]*<\/)
Positive look ahead that is looking for a</
i.e.
end tag without matching<
and>
before it[^<>]*?
Example
Find exact word between two HTML tags
Assuming no nesting of tags, here are three options depending on your regex flavor.
Option 1: Capture Group (works everywhere)
<span[^>]*>(?:(?!</span).)*( and )[^<>]*</span>
The match is in Group 1
Option 2: \K
in Perl, PCRE (PHP, R...), Ruby 2+
<span[^>]*>(?:(?!</span).)*\K and (?=[^<>]*</span>)
Option 3: Infinite Lookbehind (.NET, regex
module for Python)
(?<=<span[^>]*>(?:(?!</span).)*) and (?=[^<>]*</span>)
Find text between two tags
Use split
on the string, like this,
line.trim.split("<hr>").dropWhile(_.isEmpty).take(1)
Array("this is find1 the line 1 ")
Update In order to find the partition that contains a string consider this,
line.split("<hr>").find( _.contains("find1"))
Some(this is find1 the line 1 )
Regex that extracts text between tags, but not the tags
You can use this following Regex:
>([^<]*)<
or, >[^<]*<
Then eliminate unwanted characters like '<' & '>'
Related Topics
Find and Replace Specific Values Within 2D Array
Finding Out Who Got the Highest Mark Among the Students
How to Reset Anaconda Root Environment
Python Json.Loads Shows Valueerror: Extra Data
Python: Editing List While Iterating Over It
Checking If a Button Has Been Pressed in Python
Find Specific Words in Text File and Print the Line Using Python
Counting Non Zero Values in Each Column of a Dataframe in Python
Deal With Overflow in Exp Using Numpy
No Output Displays When Execute Python File
How to Check Whether a Number Is Divisible by Another Number
How to Delete Tkinter Widgets from a Window
Best Way to Get the Max Value in a Spark Dataframe Column
How to Insert String Value into Specific Column Value on Python Pandas
Replace Single Quote With Double Quote in a String Python
How to Extract Column Value Within Square Brackets in Pyspark