Extracting a URL in Python
In response to the OP's edit I hijacked Find Hyperlinks in Text using Python (twitter related) and came up with this:
import re
myString = "This is my tweet check it out http://example.com/blah"
print(re.search("(?P<url>https?://[^\s]+)", myString).group("url"))
How do you extract a url from a string using python?
There may be few ways to do this but the cleanest would be to use regex
>>> myString = "This is a link http://www.google.com"
>>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
http://www.google.com
If there can be multiple links you can use something similar to below
>>> myString = "These are the links http://www.google.com and http://stackoverflow.com/questions/839994/extracting-a-url-in-python"
>>> print re.findall(r'(https?://[^\s]+)', myString)
['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']
>>>
how to extract links from a website and extract its content in web scraping using python
Add an additional request to your loop that gets to the article page and there grab the description
page = requests.get(link)
soup = BeautifulSoup(page.content, features = "lxml")
description = soup.select_one('div.articleMainText').get_text()
print(f" description: {description}")
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.annahar.com/english/section/186-mena"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, features = "lxml")
# print(soup.prettify())
#scrappring html tags such as Title, Links, Publication date
for index,new in enumerate(soup.select('div#listingDiv44083 div.article')):
published_date = new.find('span',class_="article__time-stamp").get_text(strip=True)
title = new.find('h3',class_="article__title").get_text(strip=True)
link = new.find('a',class_="article__link").attrs['href']
page = requests.get(link)
soup = BeautifulSoup(page.content, features = "lxml")
description = soup.select_one('div.articleMainText').get_text()
print(f" publish_date: {published_date}")
print(f" title: {title}")
print(f" link: {link}")
print(f" description: {description}", '\n')
Extracting URL from HTML in python
I would instead use a css attribute = value selector to target the single element housing that data as it is more intuitive upon reading. Then you simply need to extract the content
attribute and handle with json
library filtering for the url
key value pairs.
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
data = json.loads(soup.select_one('[title-text="Seleccione una línea:"]')['content'])
links = [i['url'] for i in data]
Related Topics
Pythonic Way of Checking If a Condition Holds for Any Element of a List
Import Pandas Dataframe Column as String Not Int
How to Change the Default MySQL Connection Timeout When Connecting Through Python
Multiprocessing Example Giving Attributeerror
Python Popen Command. Wait Until the Command Is Finished
How to Use Cookies in Python Requests
Blocking and Non Blocking Subprocess Calls
Differencebetween List and List[:] in Python
How to Apply a Disc Shaped Mask to a Numpy Array
Applying Udfs on Groupeddata in Pyspark (With Functioning Python Example)
Possible to Share In-Memory Data Between 2 Separate Processes
Importerror: Numpy.Core.Multiarray Failed to Import