Beautifulsoup Grab Visible Webpage Text

How to scrape only visible webpage text with BeautifulSoup?

Try this:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

Extract main text from any webpage using BeautifulSoup and Python

There is no smart way to tackle this problem. Every website has its own structure and own conventions. You might try blacklist approaches with some regex, but none of them will make you happy. I know your question is asking how to do this with bs4, but I will suggest another way to do this, which is trafilatura as shown here:

pip install trafilatura

import trafilatura
downloaded = trafilatura.fetch_url('your url here')
trafilatura.extract(downloaded)

which returns the clean content of the page as a string, and fast!

reference here: https://trafilatura.readthedocs.io/en/latest/

How to get all visible text in a web page (not html source)?

There several ways of doing this, but the one I usually use is:

from bs4 import BeautifulSoup as bs
import requests_html
s = requests_html.HTMLSession()
page = s.get('https://www.google.com')
soup=bs(page.text,'lxml')
print(soup.get_text())

Output:

About Store GmailImagesSign in Remove Report inappropriate predictions PrivacyTermsSettingsSearch settingsAdvanced searchYour data in SearchHistorySearch HelpSend feedbackAdvertisingBusiness How Search works

Grab full text with beautifulsoup

That Port number is being loaded by JavaScript. To get that data you have to use selenium.

Here is how selenium is used to get the Proxies list (with Port numbers included)

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless") 
driver = webdriver.Chrome("chromedriver.exe", options=options)

url_req = "https://spys.one/en/https-ssl-proxy/"
driver.get(url_req)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, "lxml")

driver.close()

trs = soup.find_all('tr', {'class': ['spy1x', 'spy1xx']})
for i in trs[1:]:
    print(i.select_one('td').text.strip())

Sample Output:

65.121.180.14:21988
139.162.20.252:8080
103.250.153.203:8080
190.145.200.126:53281
5.63.162.174:8080
45.190.13.50:999
34.122.246.161:3128
.
.
.

BeautifulSoup (Python): how grab text-string next to a tag (that may or may not exist)?

As you loop the listings you can test if that calendar icon class is present, if it is then grab the next_sibling

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.mascus.se/entreprenadmaskiner/begagnade-pneumatiska-hjulvaltar')
soup = bs(r.content, 'lxml')
listings = soup.select('.single-result')

for listing in listings:
    
    calendar = listing.select_one('.fa-calendar')
    
    if calendar is not None:
        print(calendar.next_sibling)
    else:
        print('Not present')

Unable to extract some links in a webpage using BeautifulSoup

The links to the images are embedded in the HTML source, but you need to get them out first. Then, once you have the image source urls, you can download them, if you feel like it.

Here's how:

import json
import re

import requests
from bs4 import BeautifulSoup

url = "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured#!#filterName:featured,viewType:masonry"

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}

soup = (
    BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
    .find("div", class_="artworks-by-dictionary")["ng-init"]
)

images = [
    i["image"] for i in json.loads(re.search(r":\s(\[.*\])", soup).group(1))
]

print("\n".join(images))

Output:

https://uploads1.wikiart.org/images/felix-vallotton/portrait-of-thadee-nathanson-1897.jpg
https://uploads1.wikiart.org/images/felix-vallotton/the-source-1897.jpg
https://uploads3.wikiart.org/00236/images/telemaco-signorini/pag026.jpg
https://uploads6.wikiart.org/images/felix-vallotton/laid-down-woman-sleeping-1899.jpg
https://uploads6.wikiart.org/images/felix-vallotton/sunset-1910.jpg
https://uploads8.wikiart.org/images/felix-vallotton/red-sand-and-snow-1901.jpg
https://uploads2.wikiart.org/images/felix-vallotton/the-pier-of-honfleur-1901.jpg
https://uploads4.wikiart.org/images/felix-vallotton/the-pont-neuf-1901.jpg
https://uploads7.wikiart.org/images/pierre-roy/les-mauvaises-graines-1901.jpg
https://uploads2.wikiart.org/images/felix-vallotton/the-way-to-locquirec-1902.jpg
https://uploads8.wikiart.org/images/felix-vallotton/the-five-painters-1902.jpg
https://uploads1.wikiart.org/images/felix-vallotton/the-toilet-1905.jpg

and more...

EDIT:

Actually, there's an even easier way to get that data:

import json

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
    "X-Requested-With": "XMLHttpRequest",
    "Referer": "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured"
}

url = "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured&json=2&layout=new&page=1&resultType=masonry"
paintings = requests.get(url, headers=headers).json()["Paintings"]

for painting in paintings:
    print(f"{painting['artistName']} - {painting['title']}\n{painting['image']}")

Output:

Felix Vallotton - Portrait of Thadee Nathanson
https://uploads1.wikiart.org/images/felix-vallotton/portrait-of-thadee-nathanson-1897.jpg
Felix Vallotton - The Source
https://uploads1.wikiart.org/images/felix-vallotton/the-source-1897.jpg
Telemaco Signorini - The morning toilet
https://uploads3.wikiart.org/00236/images/telemaco-signorini/pag026.jpg
Felix Vallotton - Laid down woman, sleeping
https://uploads6.wikiart.org/images/felix-vallotton/laid-down-woman-sleeping-1899.jpg
Felix Vallotton - Sunset
https://uploads6.wikiart.org/images/felix-vallotton/sunset-1910.jpg
Felix Vallotton - Red Sand and Snow
https://uploads8.wikiart.org/images/felix-vallotton/red-sand-and-snow-1901.jpg
Felix Vallotton - The pier of Honfleur
https://uploads2.wikiart.org/images/felix-vallotton/the-pier-of-honfleur-1901.jpg

and more ...

BONUS

By incrementing the page value in the URL you can paginate the search.

Beautifulsoup Grab Visible Webpage Text

How to scrape only visible webpage text with BeautifulSoup?

Extract main text from any webpage using BeautifulSoup and Python

How to get all visible text in a web page (not html source)?

Grab full text with beautifulsoup

BeautifulSoup (Python): how grab text-string next to a tag (that may or may not exist)?

Unable to extract some links in a webpage using BeautifulSoup

Related Topics

Leave a reply