How to Read the Contents of an Url with Python

How can I read the contents of an URL with Python?

To answer your question:

import urllib

link = "http://www.somesite.com/details.pl?urn=2344"
f = urllib.urlopen(link)
myfile = f.read()
print(myfile)

You need to read(), not readline()

EDIT (2018-06-25): Since Python 3, the legacy urllib.urlopen() was replaced by urllib.request.urlopen() (see notes from https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen for details).

If you're using Python 3, see answers by Martin Thoma or i.n.n.m within this question:
https://stackoverflow.com/a/28040508/158111 (Python 2/3 compat)
https://stackoverflow.com/a/45886824/158111 (Python 3)

Or, just get this library here: http://docs.python-requests.org/en/latest/ and seriously use it :)

import requests

link = "http://www.somesite.com/details.pl?urn=2344"
f = requests.get(link)
print(f.text)

Read and process data from URL in python

When it comes to reading data from URLs, the requests library is much simpler:

import requests

url = "https://www.example.com/your/target.html"
text = requests.get(url).text

If you haven't got it installed you could use the following to do so:

pip3 install requests

Next, why go through the hassle of shoving all of your words into a single regular expression when you could use a word array and then use a for loop instead?

For example:

search_words = "hello word world".split(" ")
matching_lines = []

for (i, line) in enumerate(text.split()):
line = line.strip()
if len(line) < 1:
continue
for word i search_words:
if re.search("\b" + word + "\b", line):
matching_lines.append(line)
continue

Then you'd output the result, like this:

print(matching_lines)

Running this where the text variable equals:

"""
this word will save the line
ignore me!
hello my friend!
what about me?
"""

Should output:

[
"this word will save the line",
"hello my friend!"
]

You could make the search case insensitive by using the lower method, like this:

search_words = [word for word in "hello word world".lower().split(" ")]
matching_lines = []

for (i, line) in enumerate(text.split()):
line = line.strip()
if len(line) < 1:
continue
line = line.lower()
for word i search_words:
if re.search("\b" + word + "\b", line):
matching_lines.append(line)
continue

Notes and information:

  1. the continue keyword prevents you from searching for more than one word match in the current line
  2. the enumerate function allows us to iterate of the index and the current line
  3. I didn't put the lower function for the words inside of the for loop to prevent you from having to call lower for every word match and every line
  4. I didn't call lower on the line until after the check because there's no point in lowercasing an empty line

Good luck.

Given a URL to a text file, what is the simplest way to read the contents of the text file?

Edit 09/2016: In Python 3 and up use urllib.request instead of urllib2

Actually the simplest way is:

import urllib2  # the lib that handles the url stuff

data = urllib2.urlopen(target_url) # it's a file like object and works just like a file
for line in data: # files are iterable
print line

You don't even need "readlines", as Will suggested. You could even shorten it to: *

import urllib2

for line in urllib2.urlopen(target_url):
print line

But remember in Python, readability matters.

However, this is the simplest way but not the safe way because most of the time with network programming, you don't know if the amount of data to expect will be respected. So you'd generally better read a fixed and reasonable amount of data, something you know to be enough for the data you expect but will prevent your script from been flooded:

import urllib2

data = urllib2.urlopen("http://www.google.com").read(20000) # read only 20 000 chars
data = data.split("\n") # then split it into lines

for line in data:
print line

* Second example in Python 3:

import urllib.request  # the lib that handles the url stuff

for line in urllib.request.urlopen(target_url):
print(line.decode('utf-8')) #utf-8 or iso8859-1 or whatever the page encoding scheme is

How can I read the contents of an URL with Transcrypt? Where is urlopen() located?

I don't believe Transcrypt has the Python urllib library available. You will need to use a corresponding JavaScript library instead. I prefer axios, but you can also just use the built in XMLHttpRequest() or window.fetch()

Here is a Python function you can incorporate that uses window.fetch():

def fetch(url, callback):
def check_response(response):
if response.status != 200:
console.error('Fetch error - Status Code: ' + response.status)
return None
return response.json()

prom = window.fetch(url)
resp = prom.then(check_response)
resp.then(callback)
prom.catch(console.error)

Just call this fetch function from your Python code and pass in the URL and a callback to utilize the response after it is received.

read urls from a text file

If you have every url in new line then simply open file, read all text and split on \n to get list of lines (without (\n))

with open('input.txt') as fh
text = fh.read()
all_links = text.split('\n')

or shorter

with open('input.txt') as fh
all_links = fh.read().split('\n')

And later you have to use for-loop to run code for all urls

# - before loop -

final_data = []

# - loop -

for url in all_links:

# ... code ...

# - after loop -

print(final_data)

# ... write in csv ...

EDIT:

import requests
from bs4 import BeautifulSoup
import csv

# - before loop -

#all_links = [
# "https://denver.craigslist.org/search/cto?purveyor-input=owner&postedToday=1",
#]

with open('input.txt') as fh:
all_links = fh.read().split('\n')

final_data = []

# - loop -

for url in all_links:
print('url:', url)

response = requests.get(url)
#print('[DEBUG] code:', response.status_code)

soup = BeautifulSoup(response.text, "html.parser")
all_rows = soup.find_all(class_="result-row")

for row in all_rows:
all_links = row.find_all(class_="hdrlnk")
for link in all_links:
href = link.get("href")
final_data.append( [href] )
print(' >', href)

print('----------')

# - after loop -

#print(final_data)

filename = "output.csv" # no need to add `./`

with open(filename, "w") as csv_file:
csv_writer = csv.writer(csv_file, delimiter=",")
csv_writer.writerow( ["links"] )
csv_writer.writerows( final_data ) # with `s` at the end

How to read urls from a text file one by one to perform

Heres the basic way


with open("file.txt", "r") as f:
un_parsed_urls = f.readlines()

parsed_urls = [url.replace("\n", "") for url in un_parsed_urls]

for url in parsed_urls:
driver.get(url)
print(driver.title)
print(driver.current_url)
userName = driver.find_element_by_name("name")
userName.send_keys("someguy")

you can use different threads to speed this up

from threading import Thread

def web_automation(url):
driver.get(url)
print(driver.title)
print(driver.current_url)
userName = driver.find_element_by_name("name")
userName.send_keys("someguy")

with open("file.txt", "r") as f:
un_parsed_urls = f.readlines()

parsed_urls = [url.replace("\n", "") for url in un_parsed_urls]

thread_list = []

for url in parsed_urls:
t = Thread(target=web_automation, args=[url])
thread_list.append(t)

for t in thread_list:
t.start()

for t in thread_list:
t.join()


Related Topics



Leave a reply



Submit