Regex to Find Words Between Two Tags

Regex to find words between two tags

You can use BeautifulSoup for this HTML parsing.

input = """"<person>John</person>went to<location>London</location>"""
soup = BeautifulSoup(input)
print soup.findAll("person")[0].renderContents()
print soup.findAll("location")[0].renderContents()

Also, it's not a good practice to use str as a variable name in python as str() means a different thing in python.

By the way, the regex can be:

import re
print re.findall("<person>(.*?)</person>", input, re.DOTALL)
print re.findall("<location>(.*?)</location>", input, re.DOTALL)

How can I use regex in python to find words between tags?

You may use re.findall in dot all mode:

var2 = """<tag> variable
number
two </tag>"""

matches = re.findall(r'<tag>(.+?)</tag>', var2, flags=re.DOTALL)
print(matches)

This prints:

[' variable\nnumber \ntwo ']

By the way, if you expect things like nested tags or other nested content, you should consider learning how to work with Python's Beautiful Soup library, which is more geared towards parsing HTML than regular expressions. That is, in general you should not use regex to parse HTML.

Regex select all text between tags

You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.

As other commenters have suggested, if you're doing something complex, use a HTML parser.

Regex that matches a specific word between tags

If the word Task can occur multiple times inside the tag, you can use a lookbehind:

(?<=<t[12]:ExecuteTask[^><]+)\bTask
  • [^><]+ negated class to match one or more characters, that're not >, < for staying inside.
  • \b matches a word boundary.

See demo at regexstorm

The lookbehind is a zero-width assertion which looks right before the word Task behind if it's preceded by one or more characters that are not > or <, preceded by <t[12]:ExecuteTask.


If you expect the word to occure only once inside, it's more efficient to use a capturing group

(<t[12]:ExecuteTask[^><]+?)\bTask

and replace with $1TaskID (where $1 matches what's captured by first group).

See another demo at regexstorm

Regex: match multiple times between tags

You may try this:

def(?=[^<>]*?<\/)

Explanation:

  1. def matches def
  2. (?=[^<>]*<\/) Positive look ahead that is looking for a </ i.e.
    end tag without matching < and > before it [^<>]*?

Example

Find exact word between two HTML tags

Assuming no nesting of tags, here are three options depending on your regex flavor.

Option 1: Capture Group (works everywhere)

<span[^>]*>(?:(?!</span).)*( and )[^<>]*</span>

The match is in Group 1

Option 2: \K in Perl, PCRE (PHP, R...), Ruby 2+

<span[^>]*>(?:(?!</span).)*\K and (?=[^<>]*</span>)

Option 3: Infinite Lookbehind (.NET, regex module for Python)

(?<=<span[^>]*>(?:(?!</span).)*) and (?=[^<>]*</span>)

Find text between two tags

Use split on the string, like this,

line.trim.split("<hr>").dropWhile(_.isEmpty).take(1)
Array("this is find1 the line 1 ")

Update In order to find the partition that contains a string consider this,

line.split("<hr>").find( _.contains("find1"))
Some(this is find1 the line 1 )

Regex that extracts text between tags, but not the tags

You can use this following Regex:

>([^<]*)<

or, >[^<]*<

Then eliminate unwanted characters like '<' & '>'



Related Topics



Leave a reply



Submit