THis post you will learn how to setup and configure HDP to work with Python 3.

Setup miniconda 3.6.5 version

https://stackoverflow.com/questions/28436769/how-to-change-default-anaconda-python-environment/28460907#28460907 Installing Python on centos Fix for R package installation issue

Good Pyspark Tutorial

Real-Time Kafka Data Ingestion into HBase via PySpark

Changing default python version to 3.6

I would suggest using ‘alternatives’ instead. As super-user (root) run the following: Start by registering python2 as an alternative

alternatives --install /usr/bin/python python /usr/bin/python2 50 alternatives --install /usr/bin/python python /usr/bin/python3.5 60 alternatives --config python

Creating and activating environment:

python3.6 -m venv gavel_env

** Go to /opt/gavel/environments folder and run the following command to start the environment. **

source gavel_env/bin/activate

Setup python3.6 has spark default environment:

export SPARK_PRINT_LAUNCH_COMMAND=1 export PYSPARK_PYTHON=python3.6 export SPARK_HISTORY_OPTS= set to hdfs file https://blog.cloudera.com/blog/2013/09/how-to-use-the-hbase-thrift-interface-part-1/ https://www.ibm.com/support/knowledgecenter/en/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/bi_spark_tables.html

Where I left?

https://thrift.apache.org/docs/install/centos

Juypter Notebook

In command prompt run the below command to open juypter notebook in the specific folder, Use SHIFT + ENTER to execute the code.

juypter notebook 

Conda create virtual environment

You can create virtual environment in windows by the below command:

conda --name virtual_env_name package_name
conda --name siva numpy

Running Pyspark on Spark

http://tech.magnetic.com/2016/03/pyspark-carpentry-how-to-launch-a-pyspark-job-with-yarn-cluster.html https://stackoverflow.com/questions/31450828/oozie-job-wont-run-if-using-pyspark-in-sparkaction/32334531

Adding depencies for pysaprk to run in spark

https://stackoverflow.com/questions/36461054/i-cant-seem-to-get-py-files-on-spark-to-work http://tech.magnetic.com/2016/03/pyspark-carpentry-how-to-launch-a-pyspark-job-with-yarn-cluster.html https://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs

Spark submit for pyspark with –py-files

spark-submit --master yarn --deploy-mode client --name "Edelman heatmap" --jars phoenix-spark-4.13.1-HBase-1.2.jar,phoenix-4.13.1-HBase-1.2-client.jar --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt --conf spark.pyspark.virtualenv.bin.path=/opt/gavel3/environments/gavel_env/bin/virtualenv --conf spark.pyspark.python=/usr/bin/python3.6 --py-files dependencies.zip Heatmap.py 101_TICKETING_TRANSACTIONS 172.19.2.13:12181

Running Pyspark on Yarn

spark-submit --master yarn --deploy-mode cluster --queue default --num-executors 20 --executor-memory 1G --executor-cores 2 --driver-memory 1G  Heatmap.py 101_TICKETING_TRANSACTIONS 172.19.2.13:12181

installing Winepi estnltk packages

https://estnltk.github.io/estnltk/1.4.1/index.html https://github.com/estnltk/estnltk

Spark submit from docker

https://dzone.com/articles/running-apache-spark-applications-in-docker-contai

Python isntallation

<![endif]–>

pip install cryptography==2.3.1
pip install  matplotlib==3.0.2
pip install  mlxtend==0.14.0
pip install  nltk==3.3
pip install numpy==1.15.3
pip install pandas==0.23.4
pip install pytz==2018.5
pip install requests==2.20.0
pip install rpy2==2.9.4
pip install scikit-learn==0.20.0
pip install scipy==1.1.0
pip install tzlocal==1.5.1
pip install urllib3==1.24
pip install pyspark==2.4.0
pip install phoenixdb==0.6

Accessing previous Row of Dataframe

Replace based on condtion

https://stackoverflow.com/questions/45011320/how-to-get-data-of-previous-row-in-apache-spark

Spark Dataframe replace values from map: https://stackoverflow.com/questions/32000646/extract-column-values-of-dataframe-as-list-in-apache-spark

HDP Spark Yarn https://hortonworks.com/tutorial/setting-up-a-spark-development-environment-with-python/

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_spark-component-guide/content/run-spark2-sample-apps.html

Setup PySpark,ORC,Hive on Python 3

wget https://repo.anaconda.com/miniconda/Miniconda2-latest-MacOSX-x86_64.sh -P /tmp/

ln -s /opt/miniconda3/bin/python3.6 /usr/bin/python3

Killing the process based on port id:

 sudo netstat -lutnp | grep -w '4041'
 sudo netstat -lutnp | grep -w '4042'
 sudo netstat -lutnp | grep -w '4043'
 sudo kill -9 36476