Spark Streaming useful links

less than 1 minute read

Published:

• Structured Streaming is to treat a live data stream as a table that is being continuously appended.

• Spark job represents a set of transformations triggered by an individual action, and you can monitor that job from the Spark UI.

Spark outputs 200 shuffle partitions. Let’s set this value to 5 to reduce the number of the output partitions from the shuffle

spark.conf.set(“spark.sql.shuffle.partitions”, “5”) flightData2015.sort(“count”).take(2)

Not your Father’s Database https://databricks.com/session/not-your-fathers-database-how-to-use-apache-spark-properly-in-your-big-data-architecture/

https://data-flair.training/blogs/apache-spark-map-vs-flatmap/

http://xinhstechblog.blogspot.com/2016/04/spark-window-functions-for-dataframes.html

https://stackoverflow.com/questions/48302090/how-to-aggregate-over-1-hour-windows-cumulatively-within-a-day-in-pyspark

https://stackoverflow.com/questions/33878370/how-to-select-the-first-row-of-each-group

https://stackoverflow.com/questions/36926856/spark-sql-how-to-append-new-row-to-dataframe-table-from-another-table?rq=1

https://stackoverflow.com/questions/52412643/generating-monthly-timestamps-between-two-dates-in-pyspark-dataframe

Parallelize questions:

1. Should we parallelize a DataFrame like we parallelize a Seq before training
2. https://elbauldelprogramador.com/en/how-to-convert-column-to-vectorudt-densevector-spark/