Given a Url to a Text File, What Is the Simplest Way to Read the Contents of the Text File

Given a URL to a text file, what is the simplest way to read the contents of the text file?

Edit 09/2016: In Python 3 and up use urllib.request instead of urllib2

Actually the simplest way is:

import urllib2  # the lib that handles the url stuff

data = urllib2.urlopen(target_url) # it's a file like object and works just like a file
for line in data: # files are iterable
    print line

You don't even need "readlines", as Will suggested. You could even shorten it to: ^*

import urllib2

for line in urllib2.urlopen(target_url):
    print line

But remember in Python, readability matters.

However, this is the simplest way but not the safe way because most of the time with network programming, you don't know if the amount of data to expect will be respected. So you'd generally better read a fixed and reasonable amount of data, something you know to be enough for the data you expect but will prevent your script from been flooded:

import urllib2

data = urllib2.urlopen("http://www.google.com").read(20000) # read only 20 000 chars
data = data.split("\n") # then split it into lines

for line in data:
    print line

^{* Second example in Python 3:}

import urllib.request  # the lib that handles the url stuff

for line in urllib.request.urlopen(target_url):
    print(line.decode('utf-8')) #utf-8 or iso8859-1 or whatever the page encoding scheme is

In Python, given a text file with some data and URL, what is the simplest way to read only the url of the text file?

Ok, i got it, thank u for ur help anyways, I put the code here, maybe for someone can be helpful. Create a txt with only the url's

#!/usr/bin/python
# -*- coding: utf-8 -*-

import re

# read the original text
 f=file("yourtextfile.txt", "r")
 content=f.read().splitlines()
 f.close()

# create the new file to save the url's
 f = file("newfile.txt","w")
 f = open("newfile.txt","w")
# for every line in the text
 for line in content:
    a = line
    contador = 0
    contador2 = 1
    for charac in a:
        # for every character in the line
        if charac == "\t" :
            # if the next characters after \t are http we copy the url till other \t appear
            if a[contador2:contador2+4] == 'http':
                url = ""
                while a[contador2] != "\t":
                    url = url + a[contador2]
                    contador2 = contador2+1

                f.write(url + '\n')
         contador = contador +1
         contador2 = contador2 +1
 f.close()

Text file import on python

The returned error message says:

Your request has been identified as part of a network of automated
tools outside of the acceptable policy and will be managed until
action is taken to declare your traffic. Please declare your
traffic by updating your user agent to include company specific
information.

You can resolve this as follows:

import urllib

url = "https://www.sec.gov/Archives/edgar/cik-lookup-data.txt"
hdr = {'User-Agent': 'Your Company Name admin@domain.com'} #change as needed

req = urllib.request.Request(url, headers=hdr) 

data = urllib.request.urlopen(req, timeout=60).read().splitlines()

>>> data[:10]
[b'!J INC:0001438823:',
 b'#1 A LIFESAFER HOLDINGS, INC.:0001509607:',
 b'#1 ARIZONA DISCOUNT PROPERTIES LLC:0001457512:',
 b'#1 PAINTBALL CORP:0001433777:',
 b'$ LLC:0001427189:',
 b'$AVY, INC.:0001655250:',
 b'& S MEDIA GROUP LLC:0001447162:',
 b'&TV COMMUNICATIONS INC.:0001479357:',
 b'&VEST DOMESTIC FUND II KPIV, L.P.:0001802417:',
 b'&VEST DOMESTIC FUND II LP:0001800903:']

How to read urls from a text file one by one to perform

Heres the basic way


with open("file.txt", "r") as f:
   un_parsed_urls = f.readlines()

parsed_urls = [url.replace("\n", "") for url in un_parsed_urls]

for url in parsed_urls:
   driver.get(url)
   print(driver.title)
   print(driver.current_url)
   userName = driver.find_element_by_name("name")
   userName.send_keys("someguy")

you can use different threads to speed this up

from threading import Thread

def web_automation(url):
   driver.get(url)
   print(driver.title)
   print(driver.current_url)
   userName = driver.find_element_by_name("name")
   userName.send_keys("someguy")

with open("file.txt", "r") as f:
   un_parsed_urls = f.readlines()

parsed_urls = [url.replace("\n", "") for url in un_parsed_urls]

thread_list = []

for url in parsed_urls:
   t = Thread(target=web_automation, args=[url])
   thread_list.append(t)

for t in thread_list:
   t.start()

for t in thread_list:
   t.join()

read urls from a text file

If you have every url in new line then simply open file, read all text and split on \n to get list of lines (without (\n))

with open('input.txt') as fh
    text = fh.read()
    all_links = text.split('\n')

or shorter

with open('input.txt') as fh
    all_links = fh.read().split('\n')

And later you have to use for-loop to run code for all urls

# - before loop -

final_data = []

# - loop -

for url in all_links:

    # ... code ...

# - after loop -

print(final_data)

# ... write in csv ...

EDIT:

import requests
from bs4 import BeautifulSoup
import csv

# - before loop -

#all_links = [
#    "https://denver.craigslist.org/search/cto?purveyor-input=owner&postedToday=1",
#]

with open('input.txt') as fh:
    all_links = fh.read().split('\n')

final_data = []

# - loop -

for url in all_links:
    print('url:', url)
    
    response = requests.get(url)
    #print('[DEBUG] code:', response.status_code)

    soup = BeautifulSoup(response.text, "html.parser")
    all_rows = soup.find_all(class_="result-row")

    for row in all_rows:
        all_links = row.find_all(class_="hdrlnk")
        for link in all_links:
            href = link.get("href")
            final_data.append( [href] )
            print('   >', href)
            
    print('----------')
    
# - after loop -

#print(final_data)

filename = "output.csv"   # no need to add `./` 

with open(filename, "w") as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=",")
    csv_writer.writerow( ["links"] )
    csv_writer.writerows( final_data )  # with `s` at the end

How to get a specific chunk of text from a URL of a text file?

You can use Pandas to read your .xlsx or .csv file and use the apply function over the SECFNAME column. Use the request library to get the text and avoid saving a local copy of the text to a file. Apply a regex similar to the text you already use in the find function, the caveat here is that has to exist a ITEM 8. From here you can print to screen or save to a file. From what I've examined, not all text links have an ITEM 7 that's why some items in the list return None.

import pandas as pd
import requests
import re

URL_PREFIX = "https://www.sec.gov/Archives/"
REGEX = r"\nITEM 7\.\s*MANAGEMENT'S DISCUSSION AND ANALYSIS.*?(?=\nITEM 8\.\s)"

def get_section(url):
    source = requests.get(f'{URL_PREFIX}/{url}').text

    r = re.findall(REGEX, source, re.M | re.DOTALL)
    if r:
        return ''.join(r)

df['has_ITEM7'] = df.SECFNAME.apply(get_section)

hasITEM7_list = df['has_ITEM7'].to_list()

Output from hasITEM7_list

['\nITEM 7. MANAGEMENT\'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS\n        OF OPERATION\n\n\nYEAR ENDED DECEMBER 28, 1997 COMPARED TO THE YEAR ENDED DECEMBER 29, 1996\n\n\n     In November 1996, the Company initiated a major restructuring and growth\nplan designed to substantially reduce its cost structure and grow the business\nin order to restore higher levels of profitability for the Company. By July\n1997, the Company completed the major phases of the restructuring plan. The\n$225.0 million of annualized cost savings anticipated from the restructuring\nresults primarily from the consolidation of administrative functions within the\nCompany, the rationalization
...
...

Given a Url to a Text File, What Is the Simplest Way to Read the Contents of the Text File