Regular Expression For Extracting Tag Attributes

Regular expression for extracting tag attributes

Update 2021: Radon8472 proposes in the comments the regex https://regex101.com/r/tOF6eA/1 (note regex101.com did not exist when I wrote originally this answer)

<a[^>]*?href=(["\'])?((?:.(?!\1|>))*.?)\1?

Update 2021 bis: Dave proposes in the comments, to take into account an attribute value containing an equal sign, like <img src="test.png?test=val" />, as in this regex101:

(\w+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?

Update (2020), Gyum Fox proposes https://regex101.com/r/U9Yqqg/2 (again, note regex101.com did not exist when I wrote originally this answer)

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?

Applied to:

<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">
<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>
<img src="test.png">
<img src="a test.png">
<img src=test.png />
<img src=a test.png />
<img src=test.png >
<img src=a test.png >
<img src=test.png alt=crap >
<img src=a test.png alt=crap >

Original answer (2008):
If you have an element like

<name attribute=value attribute="value" attribute='value'>

this regex could be used to find successively each attribute name and value

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

Applied on:

<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">

it would yield:

'href' => 'test.html'
'class' => 'xyz'

Note: This does not work with numeric attribute values e.g. <div id="1"> won't work.

Edited: Improved regex for getting attributes with no value and values with " ' " inside.

([^\r\n\t\f\v= '"]+)(?:=(["'])?((?:.(?!\2?\s+(?:\S+)=|\2))+.)\2?)?

Applied on:

<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>

it would yield:

'type' => 'text/javascript'
'defer' => ''
'async' => ''
'id' => 'something'
'onload' => 'alert(\'hello\');'

Use regular expression to extract attribute value for custom tag

If the tag you're looking for is always going to be quote, then perhaps something a little simpler is possible:

  $s ='"[QUOTE="name: Max-Fischer, post: 486662533, member: 123"]I don\'t so much dance as rhythmically convulse.[/QUOTE]';

$r = '/\[QUOTE="(.*?)"\](.*)\[\/QUOTE\]/';

$m = array();
$arr = array();
preg_match($r, $s, $m);
// m[0] = the initial string
// m[1] = the string of attributes
// m[2] = the quote itself
foreach(explode(',', $m[1]) as $valuepair) { // split the attributes on the comma
preg_match('/\s*(.*): (.*)/', $valuepair, $mm);
// mm[0] = the attribute pairing
// mm[1] = the attribute name
// mm[2] = the attribute value
$arr[$mm[1]] = $mm[2];
}
print_r($arr);
print $m[2] . "\n";

this gives the following output:

Array
(
[name] => Max-Fischer
[post] => 486662533
[member] => 123
)
I don't so much dance as rhythmically convulse.

If you want to handle the case where there is more than one quote in the string, we can do this by modifying the regex to be slightly less greedy, and then using preg_match_all, instead of preg_match

  $s ='[QUOTE="name: Max-Fischer, post: 486662533, member: 123"]I don\'t so much dance as rhythmically convulse.[/QUOTE]';
$s .='[QUOTE="name: Some-Guy, post: 486562533, member: 1234"]Quidquid latine dictum sit, altum videtur[/QUOTE]';

$r = '/\[QUOTE="(.*?)"\](.*?)\[\/QUOTE\]/';
// ^ <--- added to make it less greedy
$m = array();
$arr = array();
preg_match_all($r, $s, $m, PREG_SET_ORDER);
// m[0] = the first quote
// m[1] = the second quote
// m[0][0] = the initial string
// m[0][1] = the string of attributes
// m[0][2] = the quote itself
// element for each quote found in the string
foreach($m as $match) { // since there is more than quote, we loop and operate on them individually
$quote = array();
foreach(explode(',', $match[1]) as $valuepair) { // split the attributes on the comma
preg_match('/\s*(.*): (.*)/', $valuepair, $mm);
// mm[0] = the attribute pairing
// mm[1] = the attribute name
// mm[2] = the attribute value
$quote[$mm[1]] = $mm[2];
}
$arr[] = $quote; // we now build a parent array, to hold each individual quote
}
print_r($arr);

This gives output like:

Array
(
[0] => Array
(
[name] => Max-Fischer
[post] => 486662533
[member] => 123
)

[1] => Array
(
[name] => Some-Guy
[post] => 486562533
[member] => 1234
)

)

How to get html tag attribute values using JavaScript Regular Expressions?

You were so close! All that needs to be done now is a simple loop:

var htmlString = '<meta http-equiv="Set-Cookie" content="COOKIE1_VALUE_HERE">\n'+
'<meta http-equiv="Set-Cookie" content="COOKIE2_VALUE_HERE">\n'+
'<meta http-equiv="Set-Cookie" content="COOKIE3_VALUE_HERE">\n';

var setCookieMetaRegExp = /<meta http-equiv=[\"']?set-cookie[\"']? content=[\"'](.*)[\"'].*>/ig;

var matches = [];
while (setCookieMetaRegExp.exec(htmlString)) {
matches.push(RegExp.$1);
}

//contains all cookie values
console.log(matches);

JSBIN: http://jsbin.com/OpepUjeW/1/edit?js,console

regex to extract HTML attribute value

Using JMeter, use Regular Expression Extractor to achieve this task.

Reference Name: mynum
Regular Expression: value="(.+?)"
Template: $1$
Match No.: 1

If you specify using a Match No:, the rules are as follows:

0 = Random Match
1 = First Match
2 = Second Match
etc....

And then you can use the corresponding variable to access the match. ${mynum_1}

Extract HTML attributes in PHP with regex

HTML is not a regular language and cannot be correctly parsed with a regex. Use a DOM parser instead. Here's a solution using PHP's built-in DOMDocument class:

$string = '<ul id="value" name="Bob" custom-tag="customData">';

$dom = new DOMDocument();
$dom->loadHTML($string);

$result = array();

$ul = $dom->getElementsByTagName('ul')->item(0);
if ($ul->hasAttributes()) {
foreach ($ul->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
$result[$name] = $value;
}
}

print_r($result);

Output:

Array
(
[id] => value
[name] => Bob
[custom-tag] => customData
)

Regex to extract attribute value

This C# regex will find all title values:

(?<=\btitle=")[^"]*

The C# code is like this:

Regex regex = new Regex(@"(?<=\btitle="")[^""]*");
Match match = regex.Match(input);
string title = match.Value;

The regex uses positive lookbehind to find the position where the title value starts. It then matches everything up to the ending double quote.



Related Topics



Leave a reply



Submit