How can I read the contents of an URL with Python?
To answer your question:
import urllib
link = "http://www.somesite.com/details.pl?urn=2344"
f = urllib.urlopen(link)
myfile = f.read()
print(myfile)
You need to read()
, not readline()
EDIT (2018-06-25): Since Python 3, the legacy urllib.urlopen()
was replaced by urllib.request.urlopen()
(see notes from https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen for details).
If you're using Python 3, see answers by Martin Thoma or i.n.n.m within this question:
https://stackoverflow.com/a/28040508/158111 (Python 2/3 compat)
https://stackoverflow.com/a/45886824/158111 (Python 3)
Or, just get this library here: http://docs.python-requests.org/en/latest/ and seriously use it :)
import requests
link = "http://www.somesite.com/details.pl?urn=2344"
f = requests.get(link)
print(f.text)
Read and process data from URL in python
When it comes to reading data from URLs, the requests
library is much simpler:
import requests
url = "https://www.example.com/your/target.html"
text = requests.get(url).text
If you haven't got it installed you could use the following to do so:
pip3 install requests
Next, why go through the hassle of shoving all of your words into a single regular expression when you could use a word array and then use a for loop instead?
For example:
search_words = "hello word world".split(" ")
matching_lines = []
for (i, line) in enumerate(text.split()):
line = line.strip()
if len(line) < 1:
continue
for word i search_words:
if re.search("\b" + word + "\b", line):
matching_lines.append(line)
continue
Then you'd output the result, like this:
print(matching_lines)
Running this where the text
variable equals:
"""
this word will save the line
ignore me!
hello my friend!
what about me?
"""
Should output:
[
"this word will save the line",
"hello my friend!"
]
You could make the search case insensitive by using the lower
method, like this:
search_words = [word for word in "hello word world".lower().split(" ")]
matching_lines = []
for (i, line) in enumerate(text.split()):
line = line.strip()
if len(line) < 1:
continue
line = line.lower()
for word i search_words:
if re.search("\b" + word + "\b", line):
matching_lines.append(line)
continue
Notes and information:
- the
continue
keyword prevents you from searching for more than one word match in the current line - the
enumerate
function allows us to iterate of theindex
and the current line - I didn't put the
lower
function for the words inside of thefor
loop to prevent you from having to calllower
for every word match and every line - I didn't call
lower
on the line until after the check because there's no point in lowercasing an empty line
Good luck.
Given a URL to a text file, what is the simplest way to read the contents of the text file?
Edit 09/2016: In Python 3 and up use urllib.request instead of urllib2
Actually the simplest way is:
import urllib2 # the lib that handles the url stuff
data = urllib2.urlopen(target_url) # it's a file like object and works just like a file
for line in data: # files are iterable
print line
You don't even need "readlines", as Will suggested. You could even shorten it to: *
import urllib2
for line in urllib2.urlopen(target_url):
print line
But remember in Python, readability matters.
However, this is the simplest way but not the safe way because most of the time with network programming, you don't know if the amount of data to expect will be respected. So you'd generally better read a fixed and reasonable amount of data, something you know to be enough for the data you expect but will prevent your script from been flooded:
import urllib2
data = urllib2.urlopen("http://www.google.com").read(20000) # read only 20 000 chars
data = data.split("\n") # then split it into lines
for line in data:
print line
* Second example in Python 3:
import urllib.request # the lib that handles the url stuff
for line in urllib.request.urlopen(target_url):
print(line.decode('utf-8')) #utf-8 or iso8859-1 or whatever the page encoding scheme is
How can I read the contents of an URL with Transcrypt? Where is urlopen() located?
I don't believe Transcrypt has the Python urllib library available. You will need to use a corresponding JavaScript library instead. I prefer axios, but you can also just use the built in XMLHttpRequest() or window.fetch()
Here is a Python function you can incorporate that uses window.fetch():
def fetch(url, callback):
def check_response(response):
if response.status != 200:
console.error('Fetch error - Status Code: ' + response.status)
return None
return response.json()
prom = window.fetch(url)
resp = prom.then(check_response)
resp.then(callback)
prom.catch(console.error)
Just call this fetch function from your Python code and pass in the URL and a callback to utilize the response after it is received.
read urls from a text file
If you have every url in new line then simply open file, read all text and split on \n
to get list of lines (without (\n
))
with open('input.txt') as fh
text = fh.read()
all_links = text.split('\n')
or shorter
with open('input.txt') as fh
all_links = fh.read().split('\n')
And later you have to use for
-loop to run code for all urls
# - before loop -
final_data = []
# - loop -
for url in all_links:
# ... code ...
# - after loop -
print(final_data)
# ... write in csv ...
EDIT:
import requests
from bs4 import BeautifulSoup
import csv
# - before loop -
#all_links = [
# "https://denver.craigslist.org/search/cto?purveyor-input=owner&postedToday=1",
#]
with open('input.txt') as fh:
all_links = fh.read().split('\n')
final_data = []
# - loop -
for url in all_links:
print('url:', url)
response = requests.get(url)
#print('[DEBUG] code:', response.status_code)
soup = BeautifulSoup(response.text, "html.parser")
all_rows = soup.find_all(class_="result-row")
for row in all_rows:
all_links = row.find_all(class_="hdrlnk")
for link in all_links:
href = link.get("href")
final_data.append( [href] )
print(' >', href)
print('----------')
# - after loop -
#print(final_data)
filename = "output.csv" # no need to add `./`
with open(filename, "w") as csv_file:
csv_writer = csv.writer(csv_file, delimiter=",")
csv_writer.writerow( ["links"] )
csv_writer.writerows( final_data ) # with `s` at the end
How to read urls from a text file one by one to perform
Heres the basic way
with open("file.txt", "r") as f:
un_parsed_urls = f.readlines()
parsed_urls = [url.replace("\n", "") for url in un_parsed_urls]
for url in parsed_urls:
driver.get(url)
print(driver.title)
print(driver.current_url)
userName = driver.find_element_by_name("name")
userName.send_keys("someguy")
you can use different threads to speed this up
from threading import Thread
def web_automation(url):
driver.get(url)
print(driver.title)
print(driver.current_url)
userName = driver.find_element_by_name("name")
userName.send_keys("someguy")
with open("file.txt", "r") as f:
un_parsed_urls = f.readlines()
parsed_urls = [url.replace("\n", "") for url in un_parsed_urls]
thread_list = []
for url in parsed_urls:
t = Thread(target=web_automation, args=[url])
thread_list.append(t)
for t in thread_list:
t.start()
for t in thread_list:
t.join()
Related Topics
In Python, How to Escape Newline Characters When Printing a String
Python Script Execute Commands in Terminal
How to Set Xlim and Ylim for a Subplot in Matplotlib
How to Get a List of Column Names in SQLite
Fitting a 2D Gaussian Function Using Scipy.Optimize.Curve_Fit - Valueerror and Minpack.Error
How to Import a Python Module from a Sibling Folder
Python Ungzipping Stream of Bytes
Export a Pandas Dataframe as a Table Image
How to Redirect Stderr in Python
Using a Where _ in _ Statement
In Python, How to Put a Thread to Sleep Until a Specific Time
How to Loop Through a List by Twos
Scrapy - How to Manage Cookies/Sessions
How to Search a Word in a Word 2007 .Docx File
How to Left Align a Fixed Width String
Typeerror: 'Range' Object Does Not Support Item Assignment