Use Aws Glue Python with Numpy and Pandas Python Packages

Use AWS Glue Python with NumPy and Pandas Python Packages

I think the current answer is you cannot. According to AWS Glue Documentation:

Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

But even when I try to include a normal python written library in S3, the Glue job failed because of some HDFS permission problem. If you find a way to solve this, please let me know as well.

Using Pandas AWS Glue Python Shell Jobs

  1. Goto https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library. Check section
    To create a Python .egg or .whl file for 'how to create setup file for python shell job'
  2. In setup.py file, add line install_requires=['pandas==0.25.1']:
setup(name="<module name>",
version="0.1",
packages=['<package name if any or ignore>'],
install_requires=['pandas==0.25.1']
)

I also wrote small shell script to deploy python shell job without manual steps to create egg file and upload to s3 and deploy via cloudformation. Script does all automatically.
You may find code at https://github.com/fatangare/aws-python-shell-deploy

AWS Glue python shell - Using multiple libraries

This question is already answered by gbeaven, but for some reasons I am unable mark it as answer. This was fixed by comma separating the file paths in the additional python modules.



Related Topics



Leave a reply



Submit