How to retrieve the values of dynamic html content using Python
Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup
or requests
).
This is because the browser uses javascript to alter what it received and create new DOM elements. urllib
will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:
- parse the ajax JSON directly
- use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
- use a browser automation tool splinter
This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.
EDIT
From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')
html = driver.page_source
soup = BeautifulSoup(html)
# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text
Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source
property. Good luck :)
How to extract total values in a dynamic html with python?
Actually, data is loaded dynamically by javascript from api calls json response that's why BeautifulSoup can't grab data. The minimal working solution from api using only requests as follows:
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
params={
'items_per_page': '20',
'null':'' ,
'page': '1',
'sortBy': 'default',
'sortDir': 'desc',
'store_id': '0670386'
}
output =[]
#url = "https://m.pcone.com.tw/store/0670386?ref=d_item_store"
api_url='https://www.pcone.com.tw/api/filterSearchTP'
for i in range(1,14):
params['total_pages'] = i
resp = requests.get(api_url, headers = headers,params=params).json()
for item in resp['products']:
productname=item['name']
productprice=item['msrp']
ordercount=item['order_count']
#print(ordercount)
output.append([productname, productprice, ordercount])
df = pd.DataFrame(output, columns=['商品名稱', '價格', '購買人數'])
df.to_excel('松果-瑞昌.xlsx', index=False)
Extract data from dynamic HTML Table with Python 3
Like you mentioned in your question, this table dynamically changes by javascript
. To get around this you actually have to render the javascript
using:
- A web driver like selenium which simulates a website the same way it would look to the user (by rendering the javascript)
- requests-html, which is a relatively new module that allows you to render
javascript
on a webpage and has a lot of other amazing features for web scraping
This is one way to solve your problem using requests-html:
from requests_html import HTMLSession
ISSN = "0897-4756"
address = "https://www.journalguide.com/journals/search?type=journal-name&journal-name={}".format(ISSN)
hdr = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
ses = HTMLSession()
response = ses.get(address, headers=hdr)
response.html.render() # render the javascript to load the elements in the table
tree = response.html.lxml # no need to import lxml.html because requests-html can do this for you
print(tree.xpath('//table[@id="journal-search-results-table"]/text()'))
# >> ['\n', '\n']
print(tree.xpath('//table[@id="journal-search-results-table"]//td/text()'))
# >> ['ACS Publications', '1.905', 'No', '\n', '\n', '\n']
How to extract dynamic html content using python
You could try something like this:
from bs4 import BeautifulSoup
html = """<section>
<div class="columns">
<div class="column">
<div class="message is-primary">
<header class="message-header">
<h4>Technical Characteristics</h4>
</header>
<div class="message-body">
<dl class="dl-horizontal">
<dt>END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd>
<dt>BODY STYLE</dt>
<dd>(AAQL) TUBE TYPE</dd>
<dt>CONTINUOUS CURRENT RATING IN AMPS</dt>
<dd>(AEBJ) 1.600</dd>
<dt>III END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd>
</dl>
</div>
</div>
</div>
</div>
</section>"""
soup = BeautifulSoup(html, 'html.parser')
dts = soup.find_all("dt")
outs = {i.string: i.find_next("dd").string for i in dts}
print(outs)
#> {'END ITEM IDENTIFICATION': '(AGAV) END ITEM 6675014301965', 'BODY STYLE': '(AAQL) TUBE TYPE', 'CONTINUOUS CURRENT RATING IN AMPS': '(AEBJ) 1.600', 'III END ITEM IDENTIFICATION': '(AGAV) END ITEM 6675014301965'}
Created on 2018-09-28 by the reprexpy package
import reprexpy
print(reprexpy.SessionInfo())
#> Session info --------------------------------------------------------------------
#> Platform: Darwin-17.7.0-x86_64-i386-64bit (64-bit)
#> Python: 3.6
#> Date: 2018-09-28
#> Packages ------------------------------------------------------------------------
#> beautifulsoup4==4.6.3
#> reprexpy==0.1.1
Construct a dynamic HTML content using Database values
Best way to approach this would be to read the data using Psycopg, and then some templating engine - jinja2, for instance - to generate the HTML given data from the database.
Update:
A sketch of a solution might look like this (untested).
import psycopg2
import jinja2
from collections import namedtuple
TEMPLATE="""
<html><head></head><body>
<ul>
{% for row in rows %}
<li>{{ row.column1 }}, {{ row.column2 }}</li>
{% endfor %}
</ul>
</body></html>
"""
env = jinja2.Environment()
template = env.from_string(TEMPLATE)
with psycopg2.connect("dbname=test user=postgres") as conn:
with conn.cursor() as curs:
curs.execute("SELECT column1, column2 FROM test;")
row_tuple = namedtuple("Row", [col[0] for col in curs.description])
print(template.render(rows=[row_tuple(row) for row in curs.fetchall()]))
getting html dynamic content python3
Here is a solution using Selenium and Firefox:
- Open a browser window and navigating to the url
- Waiting till the link for practice appears
- Extracting all span elements that hold part of the text
- Create the output string. In case the first word has only one letter there will be only 2 span elements. If the word has more than one letter there will be 3 span elements.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'http://play.typeracer.com/'
browser = webdriver.Firefox()
browser.get(url)
try: # waiting till link is loaded
element = WebDriverWait(browser, 30).until(
EC.presence_of_element_located((By.LINK_TEXT, 'Practice')))
finally: # link loaded -> click it
element.click()
try: # wait till text is loaded
WebDriverWait(browser, 30).until(
EC.presence_of_element_located((By.XPATH, '//span[@unselectable="on"]')))
finally: # extract text
spans = browser.find_elements_by_xpath('//span[@unselectable="on"]')
if len(spans) == 2: # first word has only one letter
text = f'{spans[0].text} {spans[1].text}'
elif len(spans) == 3: # first word has more than one letter
text = f'{spans[0].text}{spans[1].text} {spans[2].text}'
else:
text = ' '.join([span.text for span in spans])
print('special case that is not handled yet: {text}')
print(text)
>>> 'Scissors cuts paper. Paper covers rock. Rock crushes lizard. Lizard poisons Spock. Spock smashes scissors. Scissors decapitates lizard. Lizard eats paper. Paper disproves Spock. Spock vaporizes rock. And as it always has, rock crushes scissors.'
Update
Just in case you also want to automate the typing afterwards ;)
try:
txt_input = WebDriverWait(browser, 30).until(
EC.presence_of_element_located((By.XPATH,
'//input[@class="txtInput" and @autocorrect="off"]')))
finally:
for letter in text:
txt_input.send_keys(letter)
The reason for the try:... finally: ...
blocks is that we have to wait till the content is loaded - which can sometimes take quite a bit.
How to send dynamically a value from HTML on Flask?
<td><input type="text" name="id_country" value="{{row[0]}}"></td>
You Set in each loop the same id so you have a lot of this inputs with same ids and thats the reason why you get the first value
I prefer links for such things and not a form, and use url parameter on flask side.
<a href="http://yourip/delete_country/{{row[0]}}"></a>
@app.route("/delete_country/<id>", methods=['POST'])
def delete_country(id):
id_country = id
return id_country
Related Topics
Error "Microsoft Visual C++ 14.0 Is Required (Unable to Find Vcvarsall.Bat)"
How to Access Object Attribute Given String Corresponding to Name of That Attribute
Finding the Index of an Item in a List
Best Way to Structure a Tkinter Application
How to Add Sequential Counter Column on Groups Using Pandas Groupby
How to Detect Collisions Between Two Rectangular Objects or Images in Pygame
How to Rotate an Image Around Its Center Using Pygame
Find All Files in a Directory With Extension .Txt in Python
How to Unload (Reload) a Python Module
Redirect Stdout to a File in Python
Split a Pandas Column of Lists into Multiple Columns
Multiprocessing VS Threading Python
Configure Flask Dev Server to Be Visible Across the Network
Difference Between @Staticmethod and @Classmethod
Get Statistics For Each Group (Such as Count, Mean, etc) Using Pandas Groupby
How to Do Relative Imports in Python