Get JS var value in HTML source using BeautifulSoup in Python
The simplest approach is to use a regular expression pattern to both locate the element via BeautifulSoup
and extract the desired substring:
import re
from bs4 import BeautifulSoup
data = """
<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>
"""
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print(pattern.search(script.text).group(1))
Prints hello
.
beautifulsoup get URL from javascript variable
It's possible that the whitespace is the problem, and the // isn't needed. This may be the issue (sorry I don't have python right now to try it):
p = re.compile('var\s+abc_url\s+=\s+(.*);')
How to get data from <script> with var using beautifulsoup?
You have several issues with your approach still:
To pass a string to
json.loads()
, it needs to be valid JSON; otherwise, you'll get exceptions. For what you're attempting to capture, you need to include the leading{
token as part of your capture group. Consolidate your two separate patterns as such:var thumbdata = ({\n.*?);
Regex101
You'll notice even with that change to grab the leading curly brace token, the string you've extracted still isn't valid JSON. While not the case with plain-old JavaScript objects, all key names must be encapsulated in quotes; the text you'll be extracting does not do this up front. As such, you'll need to swap out the built-in JSON parser (which is strictly spec-compliant and will not parse this data as JSON as-is) for something like
hjson
, which doesn't implement a spec with this restriction.Relevant SO thread
re.match()
doesn't behave as you seem to think it does. A dive into the documentation for this method is illuminating in this specific circumstance (emphasis mine):Note that even in
MULTILINE
mode,re.match()
will only match at the beginning of the string and not at the beginning of each line.This is important as the string data in
script9
does not begin with any data that would be considered "matching" per your pattern. Instead, swap the invocation ofre.match()
forre.search()
instead.
Making a few more adjustments for the changes described above, your code would look something more like the following:
import re
import hjson
script9 = ''' sometext;
var thumbdata = {
thumbs: [{avatar: "/i/nophoto.jpg", username: "IslandGirlSearching",la:"0 second ",chatid: "0",userid: "2088789", age:"21",city:"Cebu"},{avatar: "/p/2021-08/Cristina266/ava-1629535964.jpg", username: "Cristina266",la:"0 second ",chatid: "0",userid: "2095868", age:"26",city:"Pasig City"} ] };
var source = sometext;
'''
pattern = re.compile(r"var thumbdata = ({\n.*?);")
m = pattern.search(script9)
thumbs = list(hjson.loads(m.groups()[0]).items())
print(thumbs)
Repl.it
outputs:
[('thumbs', [OrderedDict([('avatar', '/i/nophoto.jpg'), ('username', 'IslandGirlSearching'), ('la', '0 second '), ('chatid', '0'), ('userid', '2088789'), ('age', '21'), ('city', 'Cebu')]), OrderedDict([('avatar', '/p/2021-08/Cristina266/ava-1629535964.jpg'), ('username', 'Cristina266'), ('la', '0 second '), ('chatid', '0'), ('userid', '2095868'), ('age', '26'), ('city', 'Pasig City')])])]
('thumbs', [OrderedDict([('avatar', '/i/nophoto.jpg'), ('username', 'IslandGirlSearching'), ('la', '0 second '), ('chatid', '0'), ('userid', '2088789'), ('age', '21'), ('city', 'Cebu')]), OrderedDict([('avatar', '/p/2021-08/Cristina266/ava-1629535964.jpg'), ('username', 'Cristina266'), ('la', '0 second '), ('chatid', '0'), ('userid', '2095868'), ('age', '26'), ('city', 'Pasig City')])])
How to get JavaScript variables from a script tag using Python and Beautifulsoup
If you are using selenium there's no need to parse the html to get the js variable, just use selenum webdriver.execute_script()
to get it to python:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://whatever.com/')
meta = driver.execute_script('return meta')
And thats it, meta now holds the js variable, and it maintains its type
Access to javascript variable with BeautifulSoup
You can't. BeautifulSoup is just a parser for DOM elements, it doesn't evaluate any code inside the page.
You need to "run" the page and access it while it's still "on", using, for example, Selenium, as explained in this post
Related Topics
How to Increment a Number After Every 1 Second Using JavaScript
How to Loop Over Object Properties With Ngfor in Angular
Document.Queryselector Always Returns Null
How to Know If Browser Tab Is Already Open Using JavaScript
Modify Classname When Element Is Clicked in React
How to Add Counter in Angular 6
Removing Currency Symbol and Replacing Comma With Point Using Pure JavaScript
How to Check If Element Has Focused Child Using JavaScript
Javascript Random on Array Without Repeat
How to Highlight a Part of Text in Textarea
Regexp to Match Every Occurence After N Occurences
Pull Variable Value from JavaScript Source Using Beautifulsoup4 Python
Bootstrap 4 Navbar-Toggler-Icon Does Not Appear
How to Escape an Ampersand in a JavaScript String So That the Page Will Validate Strict
Check If My HTML Table Is Empty Using JavaScript
Can Vue-Router Open a Link in a New Tab
Angular 2 Not Updating Until Any Object Is Clicked
Extracting Key:Value Pairs Assoc With Regex from String on JavaScript