AWS Lambda is a convenient way to run and deploy simple functions (FaaS) which scale gracefully under load. However, for Python, importing external packages and dependencies can be a pain given the restricted environment that AWS Lambda provides.
One of the projects we worked on had a challenge where we had a massive PDF processing job. It required the extraction of text from a given PDF and then using it for further processing. So we used the python library named “Camelot” for parsing the PDF and getting the text from it. Also, we were parsing thousands of PDFs in parallel to reduce the total time required to complete further processing. At the time, we were using a 20$ Digital Ocean droplet server to host the main HTTP application and also the same for processing all the PDFs. The problem we encountered that due to CPU bottlenecking caused by Camelot it stalled the main HTTP server by hogging on CPU time.
Camelot is an extensive library with a few external Linux dependencies as well and uses a lot of CPU while extracting text from it. So we decided to shift the load to AWS Lambda as it scales under request load and also gives an excellent parallel request performance. AWS Lambda has a lot of customizable specifications we can define for a function. Also, it supports all major programming languages like Python 2/3, NodeJS, Golang, Java and more. So we started compiling the python code we had written as a package to be deployed on AWS Lambda. Camelot uses “pandas”, “OpenCV”, “numpy” and other such libraries which makes the total package size just shy of 250MB. Also, Camelot has an external dependency for “Tkinter” and “GhostScript” which are Linux dependencies which also need to be preinstalled on the system before we can load the library.
AWS Lambda hosts functions on a Debian AMI image, and it comes with a fixed number of preinstalled packages. Now to run GhostScript in the Lambda environment, we need to get the Debian installed Ghostscript binary coupled with the code. Some of Camelot features don’t support old GhostScript binaries and the Lamdba function AMI has old binaries preinstalled which cannot be updated. Now all the dependencies for Camelot collected together were almost more than 250MB. Now the problem was AWS Lambda doesn’t support code size to be higher than 250 MB. Camelot does run, but a few features such as table text extraction from PDFs throws an error. For which GhostScript is critical as it is a PostScripting language for PDFs. Considering all these problems, we decided not to use AWS Lambda anymore because of all these constraints.
We then had to find a more lightweight library for extracting PDFs. We came across “pdftotext” which faster than Camelot. We deployed it on the main server itself because it can process many PDFs parallelly. But pdftotext cannot match the text extraction accuracy provided by Camelot in lattice mode. Here we hope AWS offers more support for massive libraries like Camelot so that we can deploy such functions. AWS also should provide more support for external libraries used by different operating systems which are critical to the runtime of a FaaS.
Ideally, to run such FaaS, we need a dynamically scalable server structure which hypothetically spins up servers/droplets according to load and divide the workload and then again scale it down gracefully. AWS Lambda is already doing that for us at a much-abstracted level, but we hope they improve the service to support small yet heavy dependency code.
Thanks for reading, have a productive day. Keep Coding!