Python - Download Images from Google Image Search

Download Images from google Image search not working

I figured a way to fix the error ,through using the CLI instead of a Jupyter Notebook file. I will list down the steps:

  1. First issue is probably how to uninstall the dependencies. Since I used the python setup.py install, I manually had to uninstall them, luckily I found python setup.py uninstall and the procedure is layed out there on how to manually uninstall (highest voted answer).
  2. I then cloned the repo again in a new folder using !git clone https://github.com/Joeclinton1/google-images-download.git via the CLI and then opened the files by cd google-images-download.
  3. After ,I reinstalled the packages using pip install . NOT python setup.py install.
  4. Use the CLI to download images by following the repo instructions in https://google-images-download.readthedocs.io/en/latest/examples.html.

This worked for me and hopefully will work for you. Note: I cloned the repo in a new folder on the desktop for ease. The downloaded images will be in the same repo folder i.e. google-images-download under the Download file.

How to download google image search results in Python

Use the Google Custom Search for what you want to achieve.
See @i08in's answer of Python - Download Images from google Image search? it has great description, script samples and libraries references.

Python: the right URL to download pictures from Google Image Search

I'll give you a hint ... start here:

https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=JULIE%20NEWMAR

Where JULIE and NEWMAR are your search terms.

That will return the json data you need ... you'll need to parse that using json.load or simplejson.load to get back a dict ... followed by diving into it to find first the responseData, then the results list which contains the individual items whose url you will then want to download.

Though I don't suggest in any way doing automated scraping of Google, since their (deprecated) API for this specifically says not to.

Scraping Google image search Python (requests, beautifulsoup)

You don't have to render JavaScript in order to get more images as wishmaster mentioned. There's exist a URL param ijn, e.g ijn=0 means 100 images and ijn=1 means 200 images, and so on.

To scrape the full-res image URL with requests and beautifulsoup you need to scrape data from the page source code via regex.

Find all <script> tags:

soup.select('script')

Match images data via regex:

matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))

Match desired images (full res size) via regex:

matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
matched_images_data)

Extract and decode them using bytes() and decode():

for fixed_full_res_image in matched_google_full_resolution_images:
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')

If you need to save them, you have two easy options via urllib.request.urlretrieve or requests:

To save images via urllib.request.urlretrieve(url, filename) (more in-depth):

import urllib.request

# often times it will throw 404 error, to avoid it we need to pass user-agent

opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)

urllib.request.urlretrieve(original_size_img, f'LOCAL_FOLDER_NAME/YOUR_IMAGE_NAME.jpg') # you can skip folder path and it will save them in current working directory

To save images via requests (code taken from this answer):

import requests

url = "YOUR_IMG.jpg"
response = requests.get(url)
if response.status_code == 200:
with open("/YOUR/PATH/TO_IMAGE/sample_img.jpg", 'wb') as f:
f.write(response.content)

Code to scrape and download full-res images and full example in the online IDE:

import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup

headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
"q": "pexels cat",
"tbm": "isch",
"hl": "en",
"ijn": "0",
}

html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')

def get_images_data():

print('\nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
source = google_image.select_one('.fxgdke').text
link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
print(f'{title}\n{source}\n{link}\n')

# this steps could be refactored to a more compact
all_script_tags = soup.select('script')

# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))

# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)

# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)

# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(', ')

print('Google Image Thumbnails:') # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')

# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)

# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))

# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
removed_matched_google_images_thumbnails)

print('\nDownloading Google Full Resolution Images:') # in order
for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)

# ------------------------------------------------
# Download original images

# print(f'Downloading {index} image...')

opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)

urllib.request.urlretrieve(original_size_img, f'Images/original_size_img_{index}.jpg')

get_images_data()

-------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...

Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
...

Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''

Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with regex to match and extract needed data from the source code of the page, instead, you only need to iterate over structured JSON and get what you want.

Code to integrate:

import os, urllib.request, json # json for pretty output
from serpapi import GoogleSearch

def get_google_images():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "pexels cat",
"tbm": "isch"
}

search = GoogleSearch(params)
results = search.get_dict()

print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))

# -----------------------
# Downloading images

for index, image in enumerate(results['images_results']):

# print(f'Downloading {index} image...')

opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)

urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')

get_google_images()

---------------
'''
[
...
{
"position": 100, # img number
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
"source": "pexels.com",
"title": "Close-up of Cat · Free Stock Photo",
"link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
"original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
"is_product": false
}
]
'''

P.S - I wrote a bit more in-depth blog post about how to scrape Google Images.

Disclaimer, I work for SerpApi.

Download and SAVE MANY Images from google Image search to a LOCAL FOLDER (Python)

You could define a function and use that function to repeat the task for each image url that you would like to write to disk:

def image_request(url, file):
response = urllib.request.urlopen(url)
fh = open(file, "wb") #open the file for writing
fh.write(response.read()) #

For example, if you had a list with urls you could loop over the list:

for i, url in enumerate(urllist):
image_request(url, str(i) + ".jpg")


Related Topics



Leave a reply



Submit