Given a URL to a text file, what is the simplest way to read the contents of the text file?
Edit 09/2016: In Python 3 and up use urllib.request instead of urllib2
Actually the simplest way is:
import urllib2 # the lib that handles the url stuff
data = urllib2.urlopen(target_url) # it's a file like object and works just like a file
for line in data: # files are iterable
print line
You don't even need "readlines", as Will suggested. You could even shorten it to: *
import urllib2
for line in urllib2.urlopen(target_url):
print line
But remember in Python, readability matters.
However, this is the simplest way but not the safe way because most of the time with network programming, you don't know if the amount of data to expect will be respected. So you'd generally better read a fixed and reasonable amount of data, something you know to be enough for the data you expect but will prevent your script from been flooded:
import urllib2
data = urllib2.urlopen("http://www.google.com").read(20000) # read only 20 000 chars
data = data.split("\n") # then split it into lines
for line in data:
print line
* Second example in Python 3:
import urllib.request # the lib that handles the url stuff
for line in urllib.request.urlopen(target_url):
print(line.decode('utf-8')) #utf-8 or iso8859-1 or whatever the page encoding scheme is
In Python, given a text file with some data and URL, what is the simplest way to read only the url of the text file?
Ok, i got it, thank u for ur help anyways, I put the code here, maybe for someone can be helpful. Create a txt with only the url's
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
# read the original text
f=file("yourtextfile.txt", "r")
content=f.read().splitlines()
f.close()
# create the new file to save the url's
f = file("newfile.txt","w")
f = open("newfile.txt","w")
# for every line in the text
for line in content:
a = line
contador = 0
contador2 = 1
for charac in a:
# for every character in the line
if charac == "\t" :
# if the next characters after \t are http we copy the url till other \t appear
if a[contador2:contador2+4] == 'http':
url = ""
while a[contador2] != "\t":
url = url + a[contador2]
contador2 = contador2+1
f.write(url + '\n')
contador = contador +1
contador2 = contador2 +1
f.close()
Text file import on python
The returned error message says:
Your request has been identified as part of a network of automated
tools outside of the acceptable policy and will be managed until
action is taken to declare your traffic. Please declare your
traffic by updating your user agent to include company specific
information.
You can resolve this as follows:
import urllib
url = "https://www.sec.gov/Archives/edgar/cik-lookup-data.txt"
hdr = {'User-Agent': 'Your Company Name admin@domain.com'} #change as needed
req = urllib.request.Request(url, headers=hdr)
data = urllib.request.urlopen(req, timeout=60).read().splitlines()
>>> data[:10]
[b'!J INC:0001438823:',
b'#1 A LIFESAFER HOLDINGS, INC.:0001509607:',
b'#1 ARIZONA DISCOUNT PROPERTIES LLC:0001457512:',
b'#1 PAINTBALL CORP:0001433777:',
b'$ LLC:0001427189:',
b'$AVY, INC.:0001655250:',
b'& S MEDIA GROUP LLC:0001447162:',
b'&TV COMMUNICATIONS INC.:0001479357:',
b'&VEST DOMESTIC FUND II KPIV, L.P.:0001802417:',
b'&VEST DOMESTIC FUND II LP:0001800903:']
How to read urls from a text file one by one to perform
Heres the basic way
with open("file.txt", "r") as f:
un_parsed_urls = f.readlines()
parsed_urls = [url.replace("\n", "") for url in un_parsed_urls]
for url in parsed_urls:
driver.get(url)
print(driver.title)
print(driver.current_url)
userName = driver.find_element_by_name("name")
userName.send_keys("someguy")
you can use different threads to speed this up
from threading import Thread
def web_automation(url):
driver.get(url)
print(driver.title)
print(driver.current_url)
userName = driver.find_element_by_name("name")
userName.send_keys("someguy")
with open("file.txt", "r") as f:
un_parsed_urls = f.readlines()
parsed_urls = [url.replace("\n", "") for url in un_parsed_urls]
thread_list = []
for url in parsed_urls:
t = Thread(target=web_automation, args=[url])
thread_list.append(t)
for t in thread_list:
t.start()
for t in thread_list:
t.join()
read urls from a text file
If you have every url in new line then simply open file, read all text and split on \n
to get list of lines (without (\n
))
with open('input.txt') as fh
text = fh.read()
all_links = text.split('\n')
or shorter
with open('input.txt') as fh
all_links = fh.read().split('\n')
And later you have to use for
-loop to run code for all urls
# - before loop -
final_data = []
# - loop -
for url in all_links:
# ... code ...
# - after loop -
print(final_data)
# ... write in csv ...
EDIT:
import requests
from bs4 import BeautifulSoup
import csv
# - before loop -
#all_links = [
# "https://denver.craigslist.org/search/cto?purveyor-input=owner&postedToday=1",
#]
with open('input.txt') as fh:
all_links = fh.read().split('\n')
final_data = []
# - loop -
for url in all_links:
print('url:', url)
response = requests.get(url)
#print('[DEBUG] code:', response.status_code)
soup = BeautifulSoup(response.text, "html.parser")
all_rows = soup.find_all(class_="result-row")
for row in all_rows:
all_links = row.find_all(class_="hdrlnk")
for link in all_links:
href = link.get("href")
final_data.append( [href] )
print(' >', href)
print('----------')
# - after loop -
#print(final_data)
filename = "output.csv" # no need to add `./`
with open(filename, "w") as csv_file:
csv_writer = csv.writer(csv_file, delimiter=",")
csv_writer.writerow( ["links"] )
csv_writer.writerows( final_data ) # with `s` at the end
How to get a specific chunk of text from a URL of a text file?
You can use Pandas
to read your .xlsx
or .csv
file and use the apply
function over the SECFNAME
column. Use the request
library to get the text and avoid saving a local copy of the text to a file. Apply a regex similar to the text you already use in the find function, the caveat here is that has to exist a ITEM 8
. From here you can print to screen or save to a file. From what I've examined, not all text links have an ITEM 7
that's why some items in the list return None
.
import pandas as pd
import requests
import re
URL_PREFIX = "https://www.sec.gov/Archives/"
REGEX = r"\nITEM 7\.\s*MANAGEMENT'S DISCUSSION AND ANALYSIS.*?(?=\nITEM 8\.\s)"
def get_section(url):
source = requests.get(f'{URL_PREFIX}/{url}').text
r = re.findall(REGEX, source, re.M | re.DOTALL)
if r:
return ''.join(r)
df['has_ITEM7'] = df.SECFNAME.apply(get_section)
hasITEM7_list = df['has_ITEM7'].to_list()
Output from hasITEM7_list
['\nITEM 7. MANAGEMENT\'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS\n OF OPERATION\n\n\nYEAR ENDED DECEMBER 28, 1997 COMPARED TO THE YEAR ENDED DECEMBER 29, 1996\n\n\n In November 1996, the Company initiated a major restructuring and growth\nplan designed to substantially reduce its cost structure and grow the business\nin order to restore higher levels of profitability for the Company. By July\n1997, the Company completed the major phases of the restructuring plan. The\n$225.0 million of annualized cost savings anticipated from the restructuring\nresults primarily from the consolidation of administrative functions within the\nCompany, the rationalization
...
...
Related Topics
Datetime Dtypes in Pandas Read_Csv
Why Does Python Code Use Len() Function Instead of a Length Method
How to Check Mousebuttonpress Event in Pyqt6
How to Write a Multidimensional Array to a Text File
What Does 'Valueerror: Cannot Reindex from a Duplicate Axis' Mean
Break // in X Axis of Matplotlib
What Are the Most Common Python Docstring Formats
How to Check If Two Segments Intersect
How to Ignore Deprecation Warnings in Python
Read Excel Cell Value and Not the Formula Computing It -Openpyxl
Purpose of "%Matplotlib Inline"
Importing an Ipynb File from Another Ipynb File
Python: Importing a Sub‑Package or Sub‑Module
How to Update/Upgrade Pip Itself from Inside My Virtual Environment
How to Access the Child Classes of an Object in Django Without Knowing the Name of the Child Class