Published:

• Structured Streaming is to treat a live data stream as a table that is being continuously appended.

• Spark job represents a set of transformations triggered by an individual action, and you can monitor that job from the Spark UI.


Spark outputs 200 shuffle partitions. Let’s set this value to 5 to reduce the number of the output partitions from the shuffle

spark.conf.set(“spark.sql.shuffle.partitions”, “5”) flightData2015.sort(“count”).take(2)

https://data-flair.training/blogs/apache-spark-map-vs-flatmap/

http://xinhstechblog.blogspot.com/2016/04/spark-window-functions-for-dataframes.html

https://stackoverflow.com/questions/48302090/how-to-aggregate-over-1-hour-windows-cumulatively-within-a-day-in-pyspark

https://stackoverflow.com/questions/33878370/how-to-select-the-first-row-of-each-group

https://stackoverflow.com/questions/36926856/spark-sql-how-to-append-new-row-to-dataframe-table-from-another-table?rq=1

https://stackoverflow.com/questions/52412643/generating-monthly-timestamps-between-two-dates-in-pyspark-dataframe

Parallelize questions:

1. Should we parallelize a DataFrame like we parallelize a Seq before training