How to get all article pages under a Wikipedia Category and its sub-categories?
The following resource will help you to download all pages from the category and all its subcategories:
http://en.wikipedia.org/wiki/Wikipedia:CatScan
There is also an API available here:
https://www.mediawiki.org/wiki/API:Categorymembers
How to scrape Subcategories and pages in categories of a Category wikipedia page using Python
Ok so after doing more research and study, I was able to find an answer to my own question. Using the libraries urllib.request and json, I imported the wikipedia url file in format json and simply printed its categories out that way. Here's the code I used to get the subcategories:
pages = urllib.request.urlopen("https://en.wikipedia.org/w/api.phpaction=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat")
data = json.load(pages)
query = data['query']
category = query['categorymembers']
for x in category:
print (x['title'])
And you can do the same thing for pages in category. Thanks to Nemo for trying to help me out!
How to get the list of all wikipedia categories containing an article?
The Categories query module of the API does that. Use something like
…?action=query&prop=categories&titles=…
How to find subcategories and subpages on wikipedia using pywikibot?
How can i get the subcategories and (sub)-pages of a given category?
First you have to use a Category
class instead of a Page
class. You have to create it quite similar:
>>> import pywikibot
>>> site = pywikibot.Site("en", "wikipedia")
>>> cat = pywikibot.Category(site, 'Masculine_given_names')
A Category class has additional methods, refer the documentation for further informations and the available parameters. The categoryinfo
property for example gives a short overview about the category content:
>>> cat.categoryinfo
{'size': 1425, 'pages': 1336, 'files': 0, 'subcats': 89}
There are 1425 entries in this category, there are 1336 pages and 89 subcategories in this case.
To get all subcategories use subcategories()
method:
>>> gen = cat.subcategories()
Note, this is a generator. As shown below you will get all of them as found in categoryinfo
above:
>>> len(list(gen))
89
To get all pages (articles) you have to use the articles()
method, e.g.
>>> gen = cat.subcategories()
Guess how many entries the corresponing list will have.
Finally there is a method to get all members of the category which includes pages, files and subcategories called members()
:
>>> gen = cat.members()
Scraping Wikipedia Subcategories (Pages) with Multiple Depths?
There is a tool called PetScan hosted by Wikimedia labs. You can easily type the category title, then select the depth you want to reach, and then it's done!. https://petscan.wmflabs.org/
Also, see how it works https://meta.m.wikimedia.org/wiki/PetScan/en
Mediawiki API: How do I list all subcategories of a category?
As the documentation mentions, you need to add cmtype=subcat
to your query:
https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics&cmtype=subcat
Related Topics
Creating New Database from a Backup of Another Database on the Same Server
Executing a Stored Procedure Inside Begin/End Transaction
Trigger Insert Old Values- Values That Was Updated
Postgres: Define a Default Value for Cast Failures
Get the Nearest Longitude and Latitude from Mssql Database Table
How to Check If Identity_Insert Is Set to on or Off in SQL Server
Flattening Intersecting Timespans
How to Split Strings in SQL Server
Bcp Returns No Errors, But Also Doesn't Copy Any Rows
How to Insert Multiple Rows with a Foreign Key Using a Cte in Postgres
How to Use Order by with Union All in SQL
Sqlite Format Number with 2 Decimal Places Always
SQL Row_Number() Function in Where Clause Without Order By
Regular Expressions Inside SQL Server
Use String Contains Function in Oracle SQL Query