How to Retrieve the Values of Dynamic HTML Content Using Python

How to retrieve the values of dynamic html content using Python

Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).

This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:

  1. parse the ajax JSON directly
  2. use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
  3. use a browser automation tool splinter

This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.


EDIT

From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')

html = driver.page_source
soup = BeautifulSoup(html)

# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text

Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)

How to extract total values in a dynamic html with python?

Actually, data is loaded dynamically by javascript from api calls json response that's why BeautifulSoup can't grab data. The minimal working solution from api using only requests as follows:

import requests
import pandas as pd

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}

params={
'items_per_page': '20',
'null':'' ,
'page': '1',
'sortBy': 'default',
'sortDir': 'desc',
'store_id': '0670386'
}
output =[]

#url = "https://m.pcone.com.tw/store/0670386?ref=d_item_store"

api_url='https://www.pcone.com.tw/api/filterSearchTP'

for i in range(1,14):
params['total_pages'] = i
resp = requests.get(api_url, headers = headers,params=params).json()
for item in resp['products']:
productname=item['name']
productprice=item['msrp']
ordercount=item['order_count']
#print(ordercount)

output.append([productname, productprice, ordercount])


df = pd.DataFrame(output, columns=['商品名稱', '價格', '購買人數'])
df.to_excel('松果-瑞昌.xlsx', index=False)

Extract data from dynamic HTML Table with Python 3

Like you mentioned in your question, this table dynamically changes by javascript. To get around this you actually have to render the javascript using:

  • A web driver like selenium which simulates a website the same way it would look to the user (by rendering the javascript)
  • requests-html, which is a relatively new module that allows you to render javascript on a webpage and has a lot of other amazing features for web scraping

This is one way to solve your problem using requests-html:

from requests_html import HTMLSession

ISSN = "0897-4756"
address = "https://www.journalguide.com/journals/search?type=journal-name&journal-name={}".format(ISSN)

hdr = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}

ses = HTMLSession()
response = ses.get(address, headers=hdr)
response.html.render() # render the javascript to load the elements in the table
tree = response.html.lxml # no need to import lxml.html because requests-html can do this for you

print(tree.xpath('//table[@id="journal-search-results-table"]/text()'))
# >> ['\n', '\n']

print(tree.xpath('//table[@id="journal-search-results-table"]//td/text()'))
# >> ['ACS Publications', '1.905', 'No', '\n', '\n', '\n']

How to extract dynamic html content using python

You could try something like this:

from bs4 import BeautifulSoup

html = """<section>
<div class="columns">
<div class="column">
<div class="message is-primary">
<header class="message-header">
<h4>Technical Characteristics</h4>
</header>
<div class="message-body">
<dl class="dl-horizontal">
<dt>END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd>
<dt>BODY STYLE</dt>
<dd>(AAQL) TUBE TYPE</dd>
<dt>CONTINUOUS CURRENT RATING IN AMPS</dt>
<dd>(AEBJ) 1.600</dd>
<dt>III END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd>
</dl>
</div>
</div>
</div>
</div>
</section>"""

soup = BeautifulSoup(html, 'html.parser')
dts = soup.find_all("dt")
outs = {i.string: i.find_next("dd").string for i in dts}
print(outs)
#> {'END ITEM IDENTIFICATION': '(AGAV) END ITEM 6675014301965', 'BODY STYLE': '(AAQL) TUBE TYPE', 'CONTINUOUS CURRENT RATING IN AMPS': '(AEBJ) 1.600', 'III END ITEM IDENTIFICATION': '(AGAV) END ITEM 6675014301965'}

Created on 2018-09-28 by the reprexpy package

import reprexpy
print(reprexpy.SessionInfo())
#> Session info --------------------------------------------------------------------
#> Platform: Darwin-17.7.0-x86_64-i386-64bit (64-bit)
#> Python: 3.6
#> Date: 2018-09-28
#> Packages ------------------------------------------------------------------------
#> beautifulsoup4==4.6.3
#> reprexpy==0.1.1

Construct a dynamic HTML content using Database values

Best way to approach this would be to read the data using Psycopg, and then some templating engine - jinja2, for instance - to generate the HTML given data from the database.

Update:

A sketch of a solution might look like this (untested).

import psycopg2
import jinja2
from collections import namedtuple

TEMPLATE="""
<html><head></head><body>
<ul>
{% for row in rows %}
<li>{{ row.column1 }}, {{ row.column2 }}</li>
{% endfor %}
</ul>
</body></html>
"""

env = jinja2.Environment()
template = env.from_string(TEMPLATE)

with psycopg2.connect("dbname=test user=postgres") as conn:
with conn.cursor() as curs:
curs.execute("SELECT column1, column2 FROM test;")
row_tuple = namedtuple("Row", [col[0] for col in curs.description])
print(template.render(rows=[row_tuple(row) for row in curs.fetchall()]))

getting html dynamic content python3

Here is a solution using Selenium and Firefox:

  1. Open a browser window and navigating to the url
  2. Waiting till the link for practice appears
  3. Extracting all span elements that hold part of the text
  4. Create the output string. In case the first word has only one letter there will be only 2 span elements. If the word has more than one letter there will be 3 span elements.
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC


    url = 'http://play.typeracer.com/'
    browser = webdriver.Firefox()
    browser.get(url)

    try: # waiting till link is loaded
    element = WebDriverWait(browser, 30).until(
    EC.presence_of_element_located((By.LINK_TEXT, 'Practice')))
    finally: # link loaded -> click it
    element.click()

    try: # wait till text is loaded
    WebDriverWait(browser, 30).until(
    EC.presence_of_element_located((By.XPATH, '//span[@unselectable="on"]')))
    finally: # extract text
    spans = browser.find_elements_by_xpath('//span[@unselectable="on"]')
    if len(spans) == 2: # first word has only one letter
    text = f'{spans[0].text} {spans[1].text}'
    elif len(spans) == 3: # first word has more than one letter
    text = f'{spans[0].text}{spans[1].text} {spans[2].text}'
    else:
    text = ' '.join([span.text for span in spans])
    print('special case that is not handled yet: {text}')


    print(text)
    >>> 'Scissors cuts paper. Paper covers rock. Rock crushes lizard. Lizard poisons Spock. Spock smashes scissors. Scissors decapitates lizard. Lizard eats paper. Paper disproves Spock. Spock vaporizes rock. And as it always has, rock crushes scissors.'

    Update

    Just in case you also want to automate the typing afterwards ;)

    try:
    txt_input = WebDriverWait(browser, 30).until(
    EC.presence_of_element_located((By.XPATH,
    '//input[@class="txtInput" and @autocorrect="off"]')))
    finally:
    for letter in text:
    txt_input.send_keys(letter)

    The reason for the try:... finally: ... blocks is that we have to wait till the content is loaded - which can sometimes take quite a bit.

    How to send dynamically a value from HTML on Flask?

    <td><input type="text" name="id_country" value="{{row[0]}}"></td>

    You Set in each loop the same id so you have a lot of this inputs with same ids and thats the reason why you get the first value

    I prefer links for such things and not a form, and use url parameter on flask side.

    <a href="http://yourip/delete_country/{{row[0]}}"></a>

    @app.route("/delete_country/<id>", methods=['POST'])
    def delete_country(id):
    id_country = id
    return id_country


    Related Topics



    Leave a reply



    Submit