How to Parse This Table and Extract Data from It

How parse a table and extract data for last 6 months Nokogiri

Generally, it's helpful if you can post some example HTML rather than a screenshot of the page. Particularly as this task is about parsing HTML.

Why do you need to check the date beforehand? Nokogiri is pretty fast, and I can't imagine the table is so big that checking as you parse will be useful. Having reviewed the Nokogiri docs, I can't see any way to do what you're describing. You'll need to grab the data from the table, and then reject any rows that have a date older than six months.

How to extract all tables (including where references) within a SQL query?

here's a more modern way:

SELECT DISTINCT p.name AS proc_name, t.name AS table_name
FROM sys.sql_dependencies d
INNER JOIN sys.procedures p ON p.object_id = d.object_id
INNER JOIN sys.tables t ON t.object_id = d.referenced_major_id
ORDER BY proc_name, table_name

a prettier approach:

SELECT DISTINCT 
[object_name] = SCHEMA_NAME(o.[schema_id]) + '.' + o.name
, o.type_desc
FROM sys.dm_sql_referenced_entities ('dbo.usp_test1', 'OBJECT') d
JOIN sys.objects o ON d.referenced_id = o.[object_id]
WHERE o.[type] IN ('U', 'V')

Technically sys.objects is deprecated, so in the future YMMV.

Extract data table from a specific page with multiple same name tables using Python BeautifulSoup

First of all, yes, it would be better to reference the ID as you would suspect the developer has made this ID unique to this table vs class which are just style descriptor.

Now, the problem run deeper. A quick look at the page code actually shows that the html that defines the table is commented out a few tags above. I suspect a script 'enables' this code on the client-side (in your browser). requests.get which just pull out the html without processing any javascript doesn't catch it (you could check the content of batting_html to verify that).

A very quick and dirty fix would be to catch the commented out code and reprocess it in BeautifulSoup:

from bs4 import Comment
...

# parse input
soup = BeautifulSoup(input_html, "lxml")
dynamic_content = soup.find("div", id="all_expanded_standings_overall")
comments = dynamic_content.find(string=lambda text: isinstance(text, Comment))
table = BeautifulSoup(comments, "lxml")

# get headers

By the way, you want to specify utf8 encoding when writing your file ...

with open(out_file_name, "w", encoding="utf8") as out_file:
writer = csv.writer(out_file)
...

Now that's really 'quick and dirty' and I would try to check deeper into the html code and javascript what is really happening before scaling this out to other pages.

How to extract data from HtmlTable in C# and arrange in a row?

Since I don't know your specific website, I used the following code to parse the

html table.

You need install Nuget -> HtmlAgilityPack.
Code:

            WebClient webClient = new WebClient();
string page = webClient.DownloadString("http://www.mufap.com.pk/payout-report.php?tab=01");

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);

List<List<string>> table = doc.DocumentNode.SelectSingleNode("//table[@class='mydata']")
.Descendants("tr")
.Skip(1)
.Where(tr => tr.Elements("td").Count() > 1)
.Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
.ToList();
string result = string.Empty;
foreach (var item in table[0])
{
result = result + " " + item;
}
Console.WriteLine(result);

The first row in website:

Sample Image

The result you will get:
Sample Image

Extract data from a table including files using Scrapy

As you are new to Scrapy, my advice would be:

  • You can either use start_urls property or start_requests() method. But, avoid using both in the same code. You can read more about it from here .

  • No need to iterate through urls as you are making request only one time.

  • Your code is not producing the output because your XPath is incorrect.

Code

import scrapy

class AlabamaSpider(scrapy.Spider):

name = 'alabama'
allowed_domains = ['purchasing.alabama.gov']

def start_requests(self):
url = 'https://purchasing.alabama.gov/active-statewide-contracts/'

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
for row in response.xpath('//*[@class="table table-bordered table-responsive-sm"]//tbody//tr'):

yield {
'Description': row.xpath('normalize-space(./td[@class="col-sm-5"])').extract_first(),
'T-NBR': row.xpath('td[@class="col-sm-1"]/a//text()').extract_first(),
'Begin Date': row.xpath('normalize-space(./td[@class="col-sm-1"][2])').extract_first(),
'End Date': row.xpath('normalize-space(./td[@class="col-sm-1"][3])').extract_first(),
'Buyer Name': row.xpath('td[@class="col-sm-3"]/a//text()').extract_first(),
'Vendor Websites': row.xpath('td[@class="col-sm-1"]/label/text()').extract_first(),
}

How to extract a table from website without specifying the web browser in python

This page uses JavaScript to load table from https://www.asxenergy.com.au/futures_nz/dataset

Server checks if it is AJAX/XHR request so it needs header

 'X-Requested-With': 'XMLHttpRequest' 

But your findAll("div",href=True, ... tries to find <div href="..."> but this page doesn't have it - so I search normal <div> with class="market-dataset"


Minimal working code.

import requests
from bs4 import BeautifulSoup

headers = {
# 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
'X-Requested-With': 'XMLHttpRequest'
}

URL = "https://www.asxenergy.com.au/futures_nz/dataset"
response = requests.get(URL, headers=headers)

soup = BeautifulSoup(response.content, "html.parser")
market_dataset = soup.findAll("div", attrs={'class':'market-dataset'})
print('len(market_dataset):', len(market_dataset))

Result:

len(market_dataset): 10


Related Topics



Leave a reply



Submit