Spark Streaming useful links

• Structured Streaming is to treat a live data stream as a table that is being continuously appended.

• Spark job represents a set of transformations triggered by an individual action, and you can monitor that job from the Spark UI.

Spark outputs 200 shuffle partitions. Let’s set this value to 5 to reduce the number of the output partitions from the shuffle

spark.conf.set(“spark.sql.shuffle.partitions”, “5”) flightData2015.sort(“count”).take(2)

Not your Father’s Database

Parallelize questions:

1. Should we parallelize a DataFrame like we parallelize a Seq before training