This is a list of notes and links to other articles and references for anyone who is trying to tackle scaling issues with Pandas, it's not a comprehensive document and is updated ad-hoc for reference.
“ Pandas is a fast, powerful, flexible and easy-to-use open source data analysis and manipulation tool, built on top of the Python programming language.”
Pandas are restrictive as you can only run one core at a time. You need to run several in parallel the below data frameworks and optimizing solutions might be handy to consider.
“A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.”
Consider the following alternatives to pandas:
Polars, Dask similar to airflow or Vaez.
https://towardsdatascience.com/top-3-alternative-python-packages-for-pandas-d125627ce349
https://towardsdatascience.com/8-alternatives-to-pandas-for-processing-large-datasets-928fc927b08c
Using modin to optimize your workloads: https://towardsdatascience.com/how-to-speed-up-pandas-with-modin-84aa6a87bcdb
https://github.com/modin-project/modin
Other things to speed up Panda workloads:
Index Optimization
Vectorise Operations
Memory Optimization
Filter Optimization
Consider also sharding (partitioning) the DB, separating reads from writes — Command and Query Responsibility Segregation- asynchronous queue commands.
https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2
https://towardsdatascience.com/how-to-make-your-pandas-operation-100x-faster-81ebcd09265c
Data Pipelines — Some notes on open source solutions
For building Data Pipelines some of the most popular open source solutions are Hadoop and Spark.
One of the pitfalls of open source solutions like Apache Spark and Apache Hadoop is scalability: “A subtler, if equally critical, problem is the way companies’ data centre deployments of Apache Hadoop and Apache Spark directly tie together the compute and storage resources in the same servers, creating an inflexible model where they must scale in lockstep. This means that almost any on-premises environment pays for high amounts of under-used disk capacity, processing power, or system memory, as each workload has different requirements for these components.”
https://d1.awsstatic.com/whitepapers/amazon_emr_migration_guide.pdf
Data Analytics — Spark — AWS — Old (2016) but much is still relevant.
https://www.youtube.com/watch?v=Mxr408U_gqo&t=2s
AWS recommends separating out your data storage in S3 from your data processing to save on costs. Also for scalability, you can use Lambdas to trigger processes at certain times of day depending on your performance requirements.
Dataframes replace RDD’s however if you need to know what an RDD is:
“RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.”
Dataframes are resilient in Spark — so if a node fails it will come back and recompute.
Hadoop versus Spark use cases and summary comparison from IBM
Hadoop is most effective for scenarios that involve the following:
- Processing big data sets in environments where data size exceeds available memory
- Batch processing with tasks that exploit disk read and write operations
- Building data analysis infrastructure with a limited budget
- Completing jobs that are not time-sensitive
- Historical and archive data analysis
Spark use cases
Spark is most effective for scenarios that involve the following:
- Dealing with chains of parallel operations by using iterative algorithms
- Achieving quick results with in-memory computations
- Analyzing stream data analysis in real-time
- Graph-parallel processing to model data
- All ML applications
Updates to follow folks…