Photo by yongzheng xu on Unsplash

Optimizing your Pandas, when you run out of scale 🐍

Gemma
3 min readJul 21, 2022

--

This is a list of notes and links to other articles and references for anyone who is trying to tackle scaling issues with Pandas, it's not a comprehensive document and is updated ad-hoc for reference.

“ Pandas is a fast, powerful, flexible and easy-to-use open source data analysis and manipulation tool, built on top of the Python programming language.”

Pandas are restrictive as you can only run one core at a time. You need to run several in parallel the below data frameworks and optimizing solutions might be handy to consider.

“A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.”

Consider the following alternatives to pandas:

Polars, Dask similar to airflow or Vaez.

https://towardsdatascience.com/top-3-alternative-python-packages-for-pandas-d125627ce349

https://towardsdatascience.com/8-alternatives-to-pandas-for-processing-large-datasets-928fc927b08c

Using modin to optimize your workloads: https://towardsdatascience.com/how-to-speed-up-pandas-with-modin-84aa6a87bcdb

https://github.com/modin-project/modin

Other things to speed up Panda workloads:

Index Optimization

Vectorise Operations

Memory Optimization

Filter Optimization

Consider also sharding (partitioning) the DB, separating reads from writes — Command and Query Responsibility Segregation- asynchronous queue commands.

https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2

https://towardsdatascience.com/how-to-make-your-pandas-operation-100x-faster-81ebcd09265c

Data Pipelines — Some notes on open source solutions

For building Data Pipelines some of the most popular open source solutions are Hadoop and Spark.

One of the pitfalls of open source solutions like Apache Spark and Apache Hadoop is scalability: “A subtler, if equally critical, problem is the way companies’ data centre deployments of Apache Hadoop and Apache Spark directly tie together the compute and storage resources in the same servers, creating an inflexible model where they must scale in lockstep. This means that almost any on-premises environment pays for high amounts of under-used disk capacity, processing power, or system memory, as each workload has different requirements for these components.”

https://d1.awsstatic.com/whitepapers/amazon_emr_migration_guide.pdf

Data Analytics — Spark — AWS — Old (2016) but much is still relevant.

https://www.youtube.com/watch?v=Mxr408U_gqo&t=2s

AWS recommends separating out your data storage in S3 from your data processing to save on costs. Also for scalability, you can use Lambdas to trigger processes at certain times of day depending on your performance requirements.

https://guyernest.medium.com/architecting-a-successful-modern-data-analytics-platform-in-the-cloud-f090b7a04696

Dataframes replace RDD’s however if you need to know what an RDD is:

“RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.”

Dataframes are resilient in Spark — so if a node fails it will come back and recompute.

Hadoop versus Spark use cases and summary comparison from IBM

Hadoop is most effective for scenarios that involve the following:

  • Processing big data sets in environments where data size exceeds available memory
  • Batch processing with tasks that exploit disk read and write operations
  • Building data analysis infrastructure with a limited budget
  • Completing jobs that are not time-sensitive
  • Historical and archive data analysis

Spark use cases

Spark is most effective for scenarios that involve the following:

  • Dealing with chains of parallel operations by using iterative algorithms
  • Achieving quick results with in-memory computations
  • Analyzing stream data analysis in real-time
  • Graph-parallel processing to model data
  • All ML applications

Updates to follow folks…

--

--

Gemma

Business Developer, programmer, solution architect, runner, swimmer, a culture and tech nerd. Busy building new solutions in emerging technologies.