How parse a table and extract data for last 6 months Nokogiri
Generally, it's helpful if you can post some example HTML rather than a screenshot of the page. Particularly as this task is about parsing HTML.
Why do you need to check the date beforehand? Nokogiri is pretty fast, and I can't imagine the table is so big that checking as you parse will be useful. Having reviewed the Nokogiri docs, I can't see any way to do what you're describing. You'll need to grab the data from the table, and then reject any rows that have a date older than six months.
How to extract all tables (including where references) within a SQL query?
here's a more modern way:
SELECT DISTINCT p.name AS proc_name, t.name AS table_name
FROM sys.sql_dependencies d
INNER JOIN sys.procedures p ON p.object_id = d.object_id
INNER JOIN sys.tables t ON t.object_id = d.referenced_major_id
ORDER BY proc_name, table_name
a prettier approach:
SELECT DISTINCT
[object_name] = SCHEMA_NAME(o.[schema_id]) + '.' + o.name
, o.type_desc
FROM sys.dm_sql_referenced_entities ('dbo.usp_test1', 'OBJECT') d
JOIN sys.objects o ON d.referenced_id = o.[object_id]
WHERE o.[type] IN ('U', 'V')
Technically sys.objects
is deprecated, so in the future YMMV.
Extract data table from a specific page with multiple same name tables using Python BeautifulSoup
First of all, yes, it would be better to reference the ID as you would suspect the developer has made this ID unique to this table vs class which are just style descriptor.
Now, the problem run deeper. A quick look at the page code actually shows that the html that defines the table is commented out a few tags above. I suspect a script 'enables' this code on the client-side (in your browser). requests.get which just pull out the html without processing any javascript doesn't catch it (you could check the content of batting_html to verify that).
A very quick and dirty fix would be to catch the commented out code and reprocess it in BeautifulSoup:
from bs4 import Comment
...
# parse input
soup = BeautifulSoup(input_html, "lxml")
dynamic_content = soup.find("div", id="all_expanded_standings_overall")
comments = dynamic_content.find(string=lambda text: isinstance(text, Comment))
table = BeautifulSoup(comments, "lxml")
# get headers
By the way, you want to specify utf8 encoding when writing your file ...
with open(out_file_name, "w", encoding="utf8") as out_file:
writer = csv.writer(out_file)
...
Now that's really 'quick and dirty' and I would try to check deeper into the html code and javascript what is really happening before scaling this out to other pages.
How to extract data from HtmlTable in C# and arrange in a row?
Since I don't know your specific website, I used the following code to parse the
html table.
You need install Nuget -> HtmlAgilityPack.
Code:
WebClient webClient = new WebClient();
string page = webClient.DownloadString("http://www.mufap.com.pk/payout-report.php?tab=01");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
List<List<string>> table = doc.DocumentNode.SelectSingleNode("//table[@class='mydata']")
.Descendants("tr")
.Skip(1)
.Where(tr => tr.Elements("td").Count() > 1)
.Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
.ToList();
string result = string.Empty;
foreach (var item in table[0])
{
result = result + " " + item;
}
Console.WriteLine(result);
The first row in website:
The result you will get:
Extract data from a table including files using Scrapy
As you are new to Scrapy, my advice would be:
You can either use
start_urls
property orstart_requests()
method. But, avoid using both in the same code. You can read more about it from here .No need to iterate through urls as you are making request only one time.
Your code is not producing the output because your
XPath
is incorrect.
Code
import scrapy
class AlabamaSpider(scrapy.Spider):
name = 'alabama'
allowed_domains = ['purchasing.alabama.gov']
def start_requests(self):
url = 'https://purchasing.alabama.gov/active-statewide-contracts/'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for row in response.xpath('//*[@class="table table-bordered table-responsive-sm"]//tbody//tr'):
yield {
'Description': row.xpath('normalize-space(./td[@class="col-sm-5"])').extract_first(),
'T-NBR': row.xpath('td[@class="col-sm-1"]/a//text()').extract_first(),
'Begin Date': row.xpath('normalize-space(./td[@class="col-sm-1"][2])').extract_first(),
'End Date': row.xpath('normalize-space(./td[@class="col-sm-1"][3])').extract_first(),
'Buyer Name': row.xpath('td[@class="col-sm-3"]/a//text()').extract_first(),
'Vendor Websites': row.xpath('td[@class="col-sm-1"]/label/text()').extract_first(),
}
How to extract a table from website without specifying the web browser in python
This page uses JavaScript to load table from https://www.asxenergy.com.au/futures_nz/dataset
Server checks if it is AJAX/XHR request so it needs header
'X-Requested-With': 'XMLHttpRequest'
But your findAll("div",href=True, ...
tries to find <div href="...">
but this page doesn't have it - so I search normal <div>
with class="market-dataset"
Minimal working code.
import requests
from bs4 import BeautifulSoup
headers = {
# 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
'X-Requested-With': 'XMLHttpRequest'
}
URL = "https://www.asxenergy.com.au/futures_nz/dataset"
response = requests.get(URL, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
market_dataset = soup.findAll("div", attrs={'class':'market-dataset'})
print('len(market_dataset):', len(market_dataset))
Result:
len(market_dataset): 10
Related Topics
Alternative for $_Server['Http_Referer'] PHP Variable in Msie
Are PHP Keywords Case-Sensitive
Laravel 5 Carbon Global Locale
Nl2Br() Equivalent in JavaScript
PHP Associative Array Key Order (Not Sort)
How to Destroy Session with Browser Closing in Codeigniter
How to Curry Method Calls in PHP
How to Find the Largest Common Substring Between Two Strings in PHP
Php: 'Or' Statement on Instruction Fail: How to Throw a New Exception
Unable to Scrape Content from a Website
Codeigniter Activerecord: Join Backticking
Google Map: Is a Lat/Lng Within a Polygon
What Is the Significance of Application Key in a Laravel Application
Strip_Tags() Function Blacklist Rather Than Whitelist
How to Remove a Password from a PDF File Using PHP
How to Set Base Url for All Pages of My Website
Difference(When Being Applied to My Code) Between Int(10) and Int(12)