Polars is a fast DataFrame library for efficient data manipulation in Python. Library requires Python 3.7 or higher. Polars operations are not in-place, returning new DataFrames
Apache Spark is an open-source cluster-computing framework for data-intensive workloads. It runs up to two orders of magnitude faster than Hadoop in memory. Spark applications run as independent processes coordinated by SparkContext
Avro is a language-independent schema-based data serialization library. Uses JSON format to specify data structure. Best suited for Big Data processing in Hadoop and Kafka. Stores schema in .avsc files with metadata
Spark-submit script launches applications on clusters using uniform interface. Applications must be packaged as assembly jars containing code and dependencies. Spark and Hadoop dependencies are provided by cluster manager at runtime. Python applications can use .py, .zip or .egg files with --py-files option
Spark standalone mode can be run on clusters or standalone machines. Security features like authentication are not enabled by default. Cluster can be started manually or using provided launch scripts. Workers connect to master via SSH with password-less access by default
Spark Web UI monitors application status and resource consumption. Transformations are instructions to driver, actions trigger execution. Application code consists of transformations and actions