App Engine Ignores Symlinks to Directories

Deploy Symfony project to App Engine - ERROR: Too many files

Increase the deployment verbosity using the --verbosity option for the gcloud app deploy command and you'll get the list of all the files uploaded. Then use the skip_files option in your app.yaml to specify the ones you want ignored:

Optional. The skip_files element specifies which files in the
application directory are not to be uploaded to App Engine. The value
is either a regular expression, or a list of regular expressions. Any
filename that matches any of the regular expressions is omitted from
the list of files to upload when the application is uploaded.
Filenames are relative to the project directory.

The skip_files has the following default:

skip_files:
- ^(.*/)?#.*#$
- ^(.*/)?.*~$
- ^(.*/)?.*\.py[co]$
- ^(.*/)?.*/RCS/.*$
- ^(.*/)?\..*$

Note: watch out for overwriting the defaults for this config.

I may be wrong, but your project structure image suggests your app code resides in the src directory. If so I'd suggest moving the app.yaml file inside it - the directory containing the app.yaml file being deployed is considered to be the top dir of the app/service - its entire content will be uploaded to GAE. You may need to adjust some paths after such move - GAE considers all app/service paths relative to this app/service top dir. If you need them, you can selectively symlink some files/directories from the project directory into the src dir, deployment follows symlinks, replacing them with their actual content.

Some related posts:

  • How to properly deploy node apps to GAE with secret keys?
  • gcloud app deploy : This deployment has too many files

Determine a file's path(s) relative to a directory, including symlinks

This, like many things, is more complex than it might appear on the surface.

Each entity in the file system points at an inode, which describes the content of the file. Entities are the things you see - files, directories, sockets, block devices, character devices, etc...

The content of a single "file" can be accessed via one or more paths - each of these paths is called a "hard link". Hard links can only point at files on the same filesystem, they cannot cross the boundary of a filesystem.

It is also possible for a path to address a "symbolic link", which can point at another path - that path doesn't have to exist, it can be another symbolic link, it can be on another filesystem, or it can point back at the original path producing an infinite loop.

It is impossible to locate all links (symbolic or hard) that point at a particular entity without scanning the entire tree.


Before we get into this... some comments:

  1. See the end for some benchmarks. I'm not convinced that this is a significant issue, though admittedly this filesystem is on a 6-disk ZFS array, on an i7, so using a lower spec system will take longer...
  2. Given that this is impossible without calling stat() on every file at some point, you're going to struggle coming up with a better solution that isn't significantly more complex (such as maintaining an index database, with all the issues that introduces)

As mentioned, we must scan (index) the whole tree. I know it's not what you want to do, but it's impossible without doing this...

To do this, you need to collect inodes, not filenames, and review them after the fact... there may be some optimisation here, but I've tried to keep it simple to prioritise understanding.

The following function will produce this structure for us:

def get_map(scan_root):
# this dict will have device IDs at the first level (major / minor) ...
# ... and inodes IDs at the second level
# each inode will have the following keys:
# - 'type' the entity's type - i.e: dir, file, socket, etc...
# - 'links' a list of all found hard links to the inode
# - 'symlinks' a list of all found symlinks to the inode
# e.g: entities[2049][4756]['links'][0] path to a hard link for inode 4756
# entities[2049][4756]['symlinks'][0] path to a symlink that points at an entity with inode 4756
entity_map = {}

for root, dirs, files in os.walk(scan_root):
root = '.' + root[len(scan_root):]
for path in [ os.path.join(root, _) for _ in files ]:
try:
p_stat = os.stat(path)
except OSError as e:
if e.errno == 2:
print('Broken symlink [%s]... skipping' % ( path ))
continue
if e.errno == 40:
print('Too many levels of symbolic links [%s]... skipping' % ( path ))
continue
raise

p_dev = p_stat.st_dev
p_ino = p_stat.st_ino

if p_dev not in entity_map:
entity_map[p_dev] = {}
e_dev = entity_map[p_dev]

if p_ino not in e_dev:
e_dev[p_ino] = {
'type': get_type(p_stat.st_mode),
'links': [],
'symlinks': [],
}
e_ino = e_dev[p_ino]

if os.lstat(path).st_ino == p_ino:
e_ino['links'].append(path)
else:
e_ino['symlinks'].append(path)

return entity_map

I've produced an example tree that looks like this:

$ tree --inodes
.
├── [ 67687] 4 -> 5
├── [ 67676] 5 -> 4
├── [ 67675] 6 -> dead
├── [ 67676] a
│   └── [ 67679] 1
├── [ 67677] b
│   └── [ 67679] 2 -> ../a/1
├── [ 67678] c
│   └── [ 67679] 3
└── [ 67687] d
└── [ 67688] 4

4 directories, 7 files

The output of this function is:

$ places
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{201: {67679: {'links': ['./a/1', './c/3'],
'symlinks': ['./b/2'],
'type': 'file'},
67688: {'links': ['./d/4'], 'symlinks': [], 'type': 'file'}}}

If we are interested in ./c/3, then you can see that just looking at symlinks (and ignoring hard links) would cause us to miss ./a/1...

By subsequently searching for the path we are interested in, we can find all other references within this tree:

def filter_map(entity_map, filename):
for dev, inodes in entity_map.items():
for inode, info in inodes.items():
if filename in info['links'] or filename in info['symlinks']:
return info
$ places ./a/1
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{'links': ['./a/1', './c/3'], 'symlinks': ['./b/2'], 'type': 'file'}

The full source for this demo is below. Note that I've used relative paths to keep things simple, but it would be sensible to update this to use absolute paths. Additionally, any symlink that points outside the tree will not currently have a corresponding link... that's an exercise for the reader.

It might also be an idea to collect the data while you're filling the tree (if that's something that would work with your process)... you can use inotify to deal with this nicely - there's even a python module.

#!/usr/bin/env python3

import os, sys, stat
from pprint import pprint

def get_type(mode):
if stat.S_ISDIR(mode):
return 'directory'
if stat.S_ISCHR(mode):
return 'character'
if stat.S_ISBLK(mode):
return 'block'
if stat.S_ISREG(mode):
return 'file'
if stat.S_ISFIFO(mode):
return 'fifo'
if stat.S_ISLNK(mode):
return 'symlink'
if stat.S_ISSOCK(mode):
return 'socket'
return 'unknown'

def get_map(scan_root):
# this dict will have device IDs at the first level (major / minor) ...
# ... and inodes IDs at the second level
# each inode will have the following keys:
# - 'type' the entity's type - i.e: dir, file, socket, etc...
# - 'links' a list of all found hard links to the inode
# - 'symlinks' a list of all found symlinks to the inode
# e.g: entities[2049][4756]['links'][0] path to a hard link for inode 4756
# entities[2049][4756]['symlinks'][0] path to a symlink that points at an entity with inode 4756
entity_map = {}

for root, dirs, files in os.walk(scan_root):
root = '.' + root[len(scan_root):]
for path in [ os.path.join(root, _) for _ in files ]:
try:
p_stat = os.stat(path)
except OSError as e:
if e.errno == 2:
print('Broken symlink [%s]... skipping' % ( path ))
continue
if e.errno == 40:
print('Too many levels of symbolic links [%s]... skipping' % ( path ))
continue
raise

p_dev = p_stat.st_dev
p_ino = p_stat.st_ino

if p_dev not in entity_map:
entity_map[p_dev] = {}
e_dev = entity_map[p_dev]

if p_ino not in e_dev:
e_dev[p_ino] = {
'type': get_type(p_stat.st_mode),
'links': [],
'symlinks': [],
}
e_ino = e_dev[p_ino]

if os.lstat(path).st_ino == p_ino:
e_ino['links'].append(path)
else:
e_ino['symlinks'].append(path)

return entity_map

def filter_map(entity_map, filename):
for dev, inodes in entity_map.items():
for inode, info in inodes.items():
if filename in info['links'] or filename in info['symlinks']:
return info

entity_map = get_map(os.getcwd())

if len(sys.argv) == 2:
entity_info = filter_map(entity_map, sys.argv[1])
pprint(entity_info)
else:
pprint(entity_map)

I've run this on my system out of curiosity. It's a 6x disk ZFS RAID-Z2 pool on an i7-7700K with plenty of data to play with. Admittedly this will run somewhat slower on lower-spec systems...

Some benchmarks to consider:

  • A dataset of ~3.1k files and links in ~850 directories.
    This runs in less than 3.5 seconds, ~80ms on subsequent runs
  • A dataset of ~30k files and links in ~2.2k directories.
    This runs in less than 30 seconds, ~300ms on subsequent runs
  • A dataset of ~73.5k files and links in ~8k directories.
    This runs in approx 60 seconds, ~800ms on subsequent runs

Using simple maths, that's about 1140 stat() calls per second with an empty cache, or ~90k stat() calls per second once the cache has been filled - I don't think that stat() is as slow as you think it is!

gcloud app deploy : This deployment has too many files

If you really have more than the 10000 files quota in the service you're trying to deploy then you might have to reduce the number accordingly.

Other things to try:

  • you might be able to get a quota increase, see Getting error on GAE: Max number of files and blobs is 10000
  • delete whatever files are not actually needed, or just skip them during deployment see skip_files or, for the more recent cloud SDK versions, the .gcloudignore file.
  • if you have a lot of static files consider moving (some of) them to GCS instead, see Approaches for overcoming 10000 file limit on Google App Engine?
  • split the service into multiple smaller services - each with its own 10000 files limit.

Assuming you do not actually hit the files quota then the error usually indicates you have looping/circular referencing symlinks in your app directory. Which could also explain a path like the one you mentioned in a comment to this post: https://stackoverflow.com/a/42425048/4495081. You just have to fix the offending symlink(s). Again, a simple/consistent directory structure could help prevent such issues.

Listing non symbolic link on Windows

There are some problems in your code:

  • you need delayed expansion because you are setting (writing) and expanding (reading) the variable count within the same parenthesised block of code (namely the for /F %%a loop);
  • in your for /F %%a loop you need to state options "eol=| delims=" in order not to run into trouble with files whose names begin with ; (such would be ignored due to the default eol=; option) and those which have white-spaces in their names (you would receive only the postion before the first white-space because of the default delims SPACE and TAB and the default option tokens=1 (see for /? for details about that);
  • dir /B returns file names only, so %%a actually points to files in the current directory rather than to C:\TEMP\; to fix that, simply change to that directory first by cd;
  • to capture the output of a command (line) and assign it to a variable, use another for /F loop and set; this loop is going to iterate once only, because find /C returns only a single line; note the escaped pipe ^| below, which is required to not execute it immediately;
  • there is no comparison operator -EQU, you need to remove the - to check for equality;
  • it is a good idea to use the quoted set syntax as it is most robust against poisonous characters;
  • file and directory paths should generally be quoted since they might contain token delimiters or other poisonous characters;

Here is the fixed script:

@echo off
setlocal EnableDelayedExpansion
pushd "C:\TEMP\" || exit /B 1
for /F "eol=| delims=" %%a in ('dir /B "."') do (
for /F %%b in ('
fsutil hardlink list "%%a" ^| find /C /V ""
') do (
set "count=%%b"
)
if !count! EQU 1 del "%%a"
)
popd
endlocal

This can even be simplified:

@echo off
pushd "C:\TEMP\" || exit /B 1
for /F "eol=| delims=" %%a in ('dir /B "."') do (
for /F %%b in ('
fsutil hardlink list "%%a" ^| find /C /V ""
') do (
if %%b EQU 1 del "%%a"
)
)
popd

Since the inner for /F loop iterates always once only, we can move the if query inside, thus avoiding the definition of an auxiliary variable which is the only one we needed delayed expansion for.

why are some cygwin symlinks not visible from a cmd.exe session

The reason why the links are not visible is due to their file Attribute

S = System are not visible in CMD by DOS/Windows design,

from CMD, sorry in German, we have:

$ cmd
Microsoft Windows [Version 10.0.19041.450]
(c) 2020 Microsoft Corporation. Alle Rechte vorbehalten.

D:\cygwin64\bin>attrib zipinfo
S D:\cygwin64\bin\zipinfo

D:\cygwin64\bin>dir zipinfo
Datenträger in Laufwerk D: ist DATA
Volumeseriennummer: D603-FB6E

Verzeichnis von D:\cygwin64\bin

Datei nicht gefunden

D:\cygwin64\bin>dir /A:S zipinfo
Datenträger in Laufwerk D: ist DATA
Volumeseriennummer: D603-FB6E

Verzeichnis von D:\cygwin64\bin

19.06.2018 22:17 16 zipinfo
1 Datei(en), 16 Bytes
0 Verzeichnis(se), 542.542.495.744 Bytes frei


Related Topics



Leave a reply



Submit