Programmatically Searching Google in Python Using Custom Search

Programmatically searching google in Python using custom search

It is possible to do this. The setup is... not very straightforward, but the end result is that you can search the entire web from python with few lines of code.

There are 3 main steps in total.

1st step: get Google API key

The pygoogle's page states:

Unfortunately, Google no longer supports the SOAP API for search, nor
do they provide new license keys. In a nutshell, PyGoogle is pretty
much dead at this point.

You can use their AJAX API instead. Take a look here for sample code:
http://dcortesi.com/2008/05/28/google-ajax-search-api-example-python-code/

... but you actually can't use AJAX API either. You have to get a Google API key. https://developers.google.com/api-client-library/python/guide/aaa_apikeys For simple experimental use I suggest "server key".

2nd step: setup Custom Search Engine so that you can search the entire web

Indeed, the old API is not available. The best new API that is available is Custom Search. It seems to support only searching within specific domains, however, after following this SO answer you can search the whole web:

  1. From the Google Custom Search homepage ( http://www.google.com/cse/ ), click Create a Custom Search Engine.
  2. Type a name and description for your search engine.
  3. Under Define your search engine, in the Sites to Search box, enter at least one valid URL (For now, just put www.anyurl.com to get
    past this screen. More on this later ).
  4. Select the CSE edition you want and accept the Terms of Service, then click Next. Select the layout option you want, and then click
    Next.
  5. Click any of the links under the Next steps section to navigate to your Control panel.
  6. In the left-hand menu, under Control Panel, click Basics.
  7. In the Search Preferences section, select Search the entire web but emphasize included sites.
  8. Click Save Changes.
  9. In the left-hand menu, under Control Panel, click Sites.
  10. Delete the site you entered during the initial setup process.

This approach is also recommended by Google: https://support.google.com/customsearch/answer/2631040

3rd step: install Google API client for Python

pip install google-api-python-client, more info here:

  • repo: https://github.com/google/google-api-python-client
  • more info: https://developers.google.com/api-client-library/python/apis/customsearch/v1
  • complete docs: https://api-python-client-doc.appspot.com/

4th step (bonus): do the search

So, after setting this up, you can follow the code samples from few places:

  • simple example: https://github.com/google/google-api-python-client/blob/master/samples/customsearch/main.py

  • cse() function docs: https://google-api-client-libraries.appspot.com/documentation/customsearch/v1/python/latest/customsearch_v1.cse.html

and end up with this:

from googleapiclient.discovery import build
import pprint

my_api_key = "Google API key"
my_cse_id = "Custom Search Engine ID"

def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res['items']

results = google_search(
'stackoverflow site:en.wikipedia.org', my_api_key, my_cse_id, num=10)
for result in results:
pprint.pprint(result)

After some tweaking you could write some functions that behave exactly like your snippet, but I'll skip this step here.

How to find Google Search results via a python package

Yh, google-search-api has been deprecated, hence the pygoogle which was a wrapper for the Google search api. At the top of the search api page, there's a warning, along with:

We encourage you to investigate the Custom Search API, which may
provide an alternative solution.


But using this custom search api to search the whole web isn't pretty straightforward. Here I found 2 detailed guides (SO answers):

  1. Programmatically searching google in Python using custom search
  • 1st step: get Google API key.
  • 2nd step: setup Custom Search Engine so that you can search the entire web.
  • 3rd step: install Google API client for Python.
  • 4th step (bonus): do the search.

So, after setting this up, you can follow the code samples from few
places:

  • simple example: https://github.com/google/google-api-python-client/blob/master/samples/customsearch/main.py
  • cse() function docs: https://google-api-client-libraries.appspot.com/documentation/customsearch/v1/python/latest/customsearch_v1.cse.html


  1. What are the alternatives now that the Google web search API has been deprecated?

Yes, Google Custom Search has now replaced the old Search API, but you
can still use Google Custom Search to search the entire web, although
the steps are not obvious from the Custom Search setup.

To create a Google Custom Search engine that searches the entire web:

  1. From the Google Custom Search homepage ( http://www.google.com/cse/
    ), click Create a Custom Search Engine.
  2. Type a name and description for your search engine.
  3. Under Define your search engine, in the Sites to Search box, enter at least one valid URL (For now, just put www.anyurl.com to get past
    this screen. More on this later ).
  4. Select the CSE edition you want and accept the Terms of Service, then click Next. Select the layout option you want, and then click
    Next.
  5. Click any of the links under the Next steps section to navigate to your Control panel.
  6. In the left-hand menu, under Control Panel, click Basics.
  7. In the Search Preferences section, select Search the entire web but emphasize included sites.
  8. Click Save Changes.
  9. In the left-hand menu, under Control Panel, click Sites.
  10. Delete the site you entered during the initial setup process.

Google Custom Search is not free all the way i.e. Pricing:

  • Custom Search Engine (free) For CSE users, the API provides 100 search queries per day for free. If you need more, you may sign up for
    billing in the API Console. Additional requests cost $5 per 1000
    queries, up to 10k queries per day.
  • Google Site Search (paid). For detailed information on GSS usage limits and quotas, please check GSS pricing options.

get more than 10 results Google custom search API

Unfortunately, it is not possible to receive more than 10 results from Google custom search API. However, if you do want more results you can make multiple calls by increasing your start parameter by 10.

See this link: https://developers.google.com/custom-search/v1/using_rest#query-params

Problem with searching multiple keywords using google custom search API

I don't have API keys to run this code but I see few mistakes:

When you use

for items in filteredList:

then you get word from list, not its index so you can't compare it with number.

To get number you would use

for items in range(len(filteredList)):

But instead of this version better use first version but then use items instead of filterd[items] in

results = google_search(items, my_api_key, my_cse_id, num=5)

If you choose version with range(len(filteredList)): then don't add 1 to items - because then you get numbers 1..6 instead of 0..5 so you skip first element filteredList[0] and it doesn't search first word. And later you try to get filteredList[6] which doesn't exist on list and you get your error message.

for word in filteredList:

results = google_search(word, my_api_key, my_cse_id, num=5)
print(results)

newDict = dict()

for result in results:
for (key, value) in result.items():
if key in keyValList:
newDict[key] = value
newDictList.append(newDict)

print(newDictList)

BTW: you have to create newDict = dict() in every loop.


BTW: standard print() and pprint.pprint() is used only to sends text on screen and always returns None so you can't assign displayed text to variable. If you have to format text then use string formatting for this.


EDIT: version with range(len(...)) which is not preferred in Python.

for index in range(len(filteredList)):

results = google_search(filteredList[index], my_api_key, my_cse_id, num=5)
print(results)

newDict = dict()

for result in results:
for (key, value) in result.items():
if key in keyValList:
newDict[key] = value
newDictList.append(newDict)

print(newDictList)

EDIT:

from googleapiclient.discovery import build
import requests

API_KEY = "AIzXXX"
CSE_ID = "013XXX"

def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res['items']

words = [
'Semkir sistem',
'Evrascon',
'Baku Electronics',
'Optimal Elektroniks',
'Avtostar',
'Improtex',
# 'Wayback Machine'
]

filtered_results = list()

keys = ['cacheId', 'link', 'htmlTitle', 'htmlSnippet', ]

for word in words:
items = google_search(word, API_KEY, CSE_ID, num=5)

for item in items:
#print(item.keys()) # to check if every item has the same keys. It seems some items don't have 'cacheId'

row = dict() # row of data in final list with results
for key in keys:
row[key] = item.get(key) # None if there is no `key` in `item`
#row[key] = item[key] # ERROR if there is no `key` in `item`

# generate link to cached page
if row['cacheId']:
row['link_cache'] = 'https://webcache.googleusercontent.com/search?q=cache:{}:{}'.format(row['cacheId'], row['link'])
# TODO: read HTML from `link_cache` and get full text.
# Maybe module `newpaper` can be useful for some pages.
# For other pages module `urllib.request` or `requests` can be needed.
row['html'] = requests.get(row['link_cache']).text
else:
row['link_cache'] = None
row['html'] = ''

# check word in title and snippet. Word may use upper and lower case chars so I convert to lower case to skip this problem.
# It doesn't work if text use native chars - ie. cyrylica
lower_word = word.lower()
if (lower_word in row['htmlTitle'].lower()) or (lower_word in row['htmlSnippet'].lower()) or (lower_word in row['html'].lower()):
filtered_results.append(row)
else:
print('SKIP:', word)
print(' :', row['link'])
print(' :', row['htmlTitle'])
print(' :', row['htmlSnippet'])
print('-----')

for item in filtered_results:
print('htmlTitle:', item['htmlTitle'])
print('link:', item['link'])
print('cacheId:', item['cacheId'])
print('link_cache:', item['link_cache'])
print('part of html:', item['html'][:300])
print('---')

How to query an advanced search with google customsearch API?

First you need to define a custom search as described here, then make sure your my_cse_id matches the google API custom search (cs) id, e.g.

cx='017576662512468239146:omuauf_lfve'

is a search engine which only searches for domains ending with .com.

Next we need our developerKey.

from googleapiclient.discovery import build
service = build("customsearch", "v1", developerKey=dev_key)

Now we can execute our search.

res = service.cse().list(q=search_term, cx=my_cse_id).execute()

We can add additional search parameters, like language or country by using the arguments described here, e.g.

res = service.cse().list(q="the best dog food", cx=my_cse_id, cr="countryUK", lr="lang_en").execute()

would serch for "the best dog food" in English and the site needs to be from the UK.


The following modified code worked for me. api_key was removed since it was never used.

from googleapiclient.discovery import build

my_cse_id = "012156694711735292392:rl7x1k3j0vy"
dev_key = "<Your developer key>"

def google_search(search_term, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=dev_key)
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res['items']

results = google_search('boxer dogs', my_cse_id, num=10, cr="countryCA", lr="lang_en")
for result in results:
print(result.get('link'))

Output

http://www.aboxerworld.com/whiteboxerfaqs.htm
http://boxerrescueontario.com/?section=available_dogs
http://www.aboxerworld.com/abouttheboxerbreed.htm
http://m.huffpost.com/ca/entry/10992754
http://rawboxers.com/aboutraw.shtml
http://www.tanoakboxers.com/
http://www.mondlichtboxers.com/
http://www.tanoakboxers.com/puppies/
http://www.landosboxers.com/dogs/puppies/puppies.htm
http://www.boxerrescuequebec.com/


Related Topics



Leave a reply



Submit