Pull Variable Value from JavaScript Source Using Beautifulsoup4 Python

Get JS var value in HTML source using BeautifulSoup in Python

The simplest approach is to use a regular expression pattern to both locate the element via BeautifulSoup and extract the desired substring:

import re

from bs4 import BeautifulSoup

data = """
<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>
"""

soup = BeautifulSoup(data, "html.parser")

pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

print(pattern.search(script.text).group(1))

Prints hello.

beautifulsoup get URL from javascript variable

It's possible that the whitespace is the problem, and the // isn't needed. This may be the issue (sorry I don't have python right now to try it):

p = re.compile('var\s+abc_url\s+=\s+(.*);') 

How to get data from <script> with var using beautifulsoup?

You have several issues with your approach still:

  1. To pass a string to json.loads(), it needs to be valid JSON; otherwise, you'll get exceptions. For what you're attempting to capture, you need to include the leading { token as part of your capture group. Consolidate your two separate patterns as such:

    var thumbdata = ({\n.*?);

    Regex101

  2. You'll notice even with that change to grab the leading curly brace token, the string you've extracted still isn't valid JSON. While not the case with plain-old JavaScript objects, all key names must be encapsulated in quotes; the text you'll be extracting does not do this up front. As such, you'll need to swap out the built-in JSON parser (which is strictly spec-compliant and will not parse this data as JSON as-is) for something like hjson, which doesn't implement a spec with this restriction.

    Relevant SO thread

  3. re.match() doesn't behave as you seem to think it does. A dive into the documentation for this method is illuminating in this specific circumstance (emphasis mine):

    Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

    This is important as the string data in script9 does not begin with any data that would be considered "matching" per your pattern. Instead, swap the invocation of re.match() for re.search() instead.

Making a few more adjustments for the changes described above, your code would look something more like the following:

import re
import hjson

script9 = ''' sometext;
var thumbdata = {
thumbs: [{avatar: "/i/nophoto.jpg", username: "IslandGirlSearching",la:"0 second ",chatid: "0",userid: "2088789", age:"21",city:"Cebu"},{avatar: "/p/2021-08/Cristina266/ava-1629535964.jpg", username: "Cristina266",la:"0 second ",chatid: "0",userid: "2095868", age:"26",city:"Pasig City"} ] };
var source = sometext;
'''

pattern = re.compile(r"var thumbdata = ({\n.*?);")

m = pattern.search(script9)
thumbs = list(hjson.loads(m.groups()[0]).items())
print(thumbs)

Repl.it

outputs:

[('thumbs', [OrderedDict([('avatar', '/i/nophoto.jpg'), ('username', 'IslandGirlSearching'), ('la', '0 second '), ('chatid', '0'), ('userid', '2088789'), ('age', '21'), ('city', 'Cebu')]), OrderedDict([('avatar', '/p/2021-08/Cristina266/ava-1629535964.jpg'), ('username', 'Cristina266'), ('la', '0 second '), ('chatid', '0'), ('userid', '2095868'), ('age', '26'), ('city', 'Pasig City')])])]
('thumbs', [OrderedDict([('avatar', '/i/nophoto.jpg'), ('username', 'IslandGirlSearching'), ('la', '0 second '), ('chatid', '0'), ('userid', '2088789'), ('age', '21'), ('city', 'Cebu')]), OrderedDict([('avatar', '/p/2021-08/Cristina266/ava-1629535964.jpg'), ('username', 'Cristina266'), ('la', '0 second '), ('chatid', '0'), ('userid', '2095868'), ('age', '26'), ('city', 'Pasig City')])])

How to get JavaScript variables from a script tag using Python and Beautifulsoup

If you are using selenium there's no need to parse the html to get the js variable, just use selenum webdriver.execute_script() to get it to python:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('https://whatever.com/')
meta = driver.execute_script('return meta')

And thats it, meta now holds the js variable, and it maintains its type

Access to javascript variable with BeautifulSoup

You can't. BeautifulSoup is just a parser for DOM elements, it doesn't evaluate any code inside the page.

You need to "run" the page and access it while it's still "on", using, for example, Selenium, as explained in this post



Related Topics



Leave a reply



Submit