How to scrape only visible webpage text with BeautifulSoup?
Try this:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))
Extract main text from any webpage using BeautifulSoup and Python
There is no smart way to tackle this problem. Every website has its own structure and own conventions. You might try blacklist approaches with some regex, but none of them will make you happy. I know your question is asking how to do this with bs4, but I will suggest another way to do this, which is trafilatura as shown here:
pip install trafilatura
import trafilatura
downloaded = trafilatura.fetch_url('your url here')
trafilatura.extract(downloaded)
which returns the clean content of the page as a string, and fast!
reference here: https://trafilatura.readthedocs.io/en/latest/
How to get all visible text in a web page (not html source)?
There several ways of doing this, but the one I usually use is:
from bs4 import BeautifulSoup as bs
import requests_html
s = requests_html.HTMLSession()
page = s.get('https://www.google.com')
soup=bs(page.text,'lxml')
print(soup.get_text())
Output:
About Store GmailImagesSign in Remove Report inappropriate predictions PrivacyTermsSettingsSearch settingsAdvanced searchYour data in SearchHistorySearch HelpSend feedbackAdvertisingBusiness How Search works
Grab full text with beautifulsoup
That Port number is being loaded by JavaScript. To get that data you have to use selenium.
Here is how selenium is used to get the Proxies list (with Port numbers included)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome("chromedriver.exe", options=options)
url_req = "https://spys.one/en/https-ssl-proxy/"
driver.get(url_req)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.close()
trs = soup.find_all('tr', {'class': ['spy1x', 'spy1xx']})
for i in trs[1:]:
print(i.select_one('td').text.strip())
Sample Output:
65.121.180.14:21988
139.162.20.252:8080
103.250.153.203:8080
190.145.200.126:53281
5.63.162.174:8080
45.190.13.50:999
34.122.246.161:3128
.
.
.
BeautifulSoup (Python): how grab text-string next to a tag (that may or may not exist)?
As you loop the listings you can test if that calendar icon class is present, if it is then grab the next_sibling
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.mascus.se/entreprenadmaskiner/begagnade-pneumatiska-hjulvaltar')
soup = bs(r.content, 'lxml')
listings = soup.select('.single-result')
for listing in listings:
calendar = listing.select_one('.fa-calendar')
if calendar is not None:
print(calendar.next_sibling)
else:
print('Not present')
Unable to extract some links in a webpage using BeautifulSoup
The links to the images are embedded in the HTML source, but you need to get them out first. Then, once you have the image source urls, you can download them, if you feel like it.
Here's how:
import json
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured#!#filterName:featured,viewType:masonry"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}
soup = (
BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
.find("div", class_="artworks-by-dictionary")["ng-init"]
)
images = [
i["image"] for i in json.loads(re.search(r":\s(\[.*\])", soup).group(1))
]
print("\n".join(images))
Output:
https://uploads1.wikiart.org/images/felix-vallotton/portrait-of-thadee-nathanson-1897.jpg
https://uploads1.wikiart.org/images/felix-vallotton/the-source-1897.jpg
https://uploads3.wikiart.org/00236/images/telemaco-signorini/pag026.jpg
https://uploads6.wikiart.org/images/felix-vallotton/laid-down-woman-sleeping-1899.jpg
https://uploads6.wikiart.org/images/felix-vallotton/sunset-1910.jpg
https://uploads8.wikiart.org/images/felix-vallotton/red-sand-and-snow-1901.jpg
https://uploads2.wikiart.org/images/felix-vallotton/the-pier-of-honfleur-1901.jpg
https://uploads4.wikiart.org/images/felix-vallotton/the-pont-neuf-1901.jpg
https://uploads7.wikiart.org/images/pierre-roy/les-mauvaises-graines-1901.jpg
https://uploads2.wikiart.org/images/felix-vallotton/the-way-to-locquirec-1902.jpg
https://uploads8.wikiart.org/images/felix-vallotton/the-five-painters-1902.jpg
https://uploads1.wikiart.org/images/felix-vallotton/the-toilet-1905.jpg
and more...
EDIT:
Actually, there's an even easier way to get that data:
import json
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
"X-Requested-With": "XMLHttpRequest",
"Referer": "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured"
}
url = "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured&json=2&layout=new&page=1&resultType=masonry"
paintings = requests.get(url, headers=headers).json()["Paintings"]
for painting in paintings:
print(f"{painting['artistName']} - {painting['title']}\n{painting['image']}")
Output:
Felix Vallotton - Portrait of Thadee Nathanson
https://uploads1.wikiart.org/images/felix-vallotton/portrait-of-thadee-nathanson-1897.jpg
Felix Vallotton - The Source
https://uploads1.wikiart.org/images/felix-vallotton/the-source-1897.jpg
Telemaco Signorini - The morning toilet
https://uploads3.wikiart.org/00236/images/telemaco-signorini/pag026.jpg
Felix Vallotton - Laid down woman, sleeping
https://uploads6.wikiart.org/images/felix-vallotton/laid-down-woman-sleeping-1899.jpg
Felix Vallotton - Sunset
https://uploads6.wikiart.org/images/felix-vallotton/sunset-1910.jpg
Felix Vallotton - Red Sand and Snow
https://uploads8.wikiart.org/images/felix-vallotton/red-sand-and-snow-1901.jpg
Felix Vallotton - The pier of Honfleur
https://uploads2.wikiart.org/images/felix-vallotton/the-pier-of-honfleur-1901.jpg
and more ...
BONUS
By incrementing the page
value in the URL you can paginate the search.
Related Topics
Find Out Who Is Logged in on Linux Using Python
Compare Two Files for Differences in Python
Python 3.4.3 Modules Installation in Linux Error
How to Protect My Python Scripts on Raspberry Pi
No Such File or Directory "Limits.H" When Installing Pillow on Alpine Linux
How Transform a Python Program .Py in an Executable Program in Ubuntu
Python Library for Linux Process Management
How to Install Pip on Arch Linux
Pytest - Specify Log Level from the Cli Command Running the Tests
Get Spotify Currently Playing Track
Tensorflow Install Fails with "Compiletime Version 3.5 of Module Does Not Match Runtime Version 3.6"
Tutorial or Guide for Scripting Xcode Build Phases
Is There Something Wrong with This Python Code, Why Does It Run So Slow Compared to Ruby