How to Get All Article Pages Under a Wikipedia Category and Its Sub-Categories

How to get all article pages under a Wikipedia Category and its sub-categories?

The following resource will help you to download all pages from the category and all its subcategories:

http://en.wikipedia.org/wiki/Wikipedia:CatScan

There is also an API available here:

https://www.mediawiki.org/wiki/API:Categorymembers

How to scrape Subcategories and pages in categories of a Category wikipedia page using Python

Ok so after doing more research and study, I was able to find an answer to my own question. Using the libraries urllib.request and json, I imported the wikipedia url file in format json and simply printed its categories out that way. Here's the code I used to get the subcategories:

pages = urllib.request.urlopen("https://en.wikipedia.org/w/api.phpaction=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat")
data = json.load(pages)
query = data['query']
category = query['categorymembers']
for x in category:
print (x['title'])

And you can do the same thing for pages in category. Thanks to Nemo for trying to help me out!

How to get the list of all wikipedia categories containing an article?

The Categories query module of the API does that. Use something like

…?action=query&prop=categories&titles=…

How to find subcategories and subpages on wikipedia using pywikibot?

How can i get the subcategories and (sub)-pages of a given category?

First you have to use a Category class instead of a Page class. You have to create it quite similar:

  >>> import pywikibot
>>> site = pywikibot.Site("en", "wikipedia")
>>> cat = pywikibot.Category(site, 'Masculine_given_names')

A Category class has additional methods, refer the documentation for further informations and the available parameters. The categoryinfo property for example gives a short overview about the category content:

  >>> cat.categoryinfo
{'size': 1425, 'pages': 1336, 'files': 0, 'subcats': 89}

There are 1425 entries in this category, there are 1336 pages and 89 subcategories in this case.

To get all subcategories use subcategories() method:

  >>> gen = cat.subcategories()

Note, this is a generator. As shown below you will get all of them as found in categoryinfo above:

  >>> len(list(gen))
89

To get all pages (articles) you have to use the articles() method, e.g.

  >>> gen = cat.subcategories()

Guess how many entries the corresponing list will have.

Finally there is a method to get all members of the category which includes pages, files and subcategories called members():

  >>> gen = cat.members()

Scraping Wikipedia Subcategories (Pages) with Multiple Depths?

There is a tool called PetScan hosted by Wikimedia labs. You can easily type the category title, then select the depth you want to reach, and then it's done!. https://petscan.wmflabs.org/

Also, see how it works https://meta.m.wikimedia.org/wiki/PetScan/en

Mediawiki API: How do I list all subcategories of a category?

As the documentation mentions, you need to add cmtype=subcat to your query:

https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics&cmtype=subcat



Related Topics



Leave a reply



Submit