THis post you will learn how to setup and configure HDP to work with Python 3.

Setup miniconda 3.6.5 version Installing Python on centos Fix for R package installation issue

Good Pyspark Tutorial

Real-Time Kafka Data Ingestion into HBase via PySpark

Changing default python version to 3.6

I would suggest using ‘alternatives’ instead. As super-user (root) run the following: Start by registering python2 as an alternative

alternatives --install /usr/bin/python python /usr/bin/python2 50 alternatives --install /usr/bin/python python /usr/bin/python3.5 60 alternatives --config python

Creating and activating environment:

python3.6 -m venv gavel_env

** Go to /opt/gavel/environments folder and run the following command to start the environment. **

source gavel_env/bin/activate

Setup python3.6 has spark default environment:

export SPARK_PRINT_LAUNCH_COMMAND=1 export PYSPARK_PYTHON=python3.6 export SPARK_HISTORY_OPTS= set to hdfs file

Where I left?

Juypter Notebook

In command prompt run the below command to open juypter notebook in the specific folder, Use SHIFT + ENTER to execute the code.

juypter notebook 

Conda create virtual environment

You can create virtual environment in windows by the below command:

conda --name virtual_env_name package_name
conda --name siva numpy

Running Pyspark on Spark

Adding depencies for pysaprk to run in spark

Spark submit for pyspark with –py-files

spark-submit --master yarn --deploy-mode client --name "Edelman heatmap" --jars phoenix-spark-4.13.1-HBase-1.2.jar,phoenix-4.13.1-HBase-1.2-client.jar --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt --conf spark.pyspark.virtualenv.bin.path=/opt/gavel3/environments/gavel_env/bin/virtualenv --conf spark.pyspark.python=/usr/bin/python3.6 --py-files 101_TICKETING_TRANSACTIONS

Running Pyspark on Yarn

spark-submit --master yarn --deploy-mode cluster --queue default --num-executors 20 --executor-memory 1G --executor-cores 2 --driver-memory 1G 101_TICKETING_TRANSACTIONS

installing Winepi estnltk packages

Spark submit from docker

Python isntallation


pip install cryptography==2.3.1
pip install  matplotlib==3.0.2
pip install  mlxtend==0.14.0
pip install  nltk==3.3
pip install numpy==1.15.3
pip install pandas==0.23.4
pip install pytz==2018.5
pip install requests==2.20.0
pip install rpy2==2.9.4
pip install scikit-learn==0.20.0
pip install scipy==1.1.0
pip install tzlocal==1.5.1
pip install urllib3==1.24
pip install pyspark==2.4.0
pip install phoenixdb==0.6

Accessing previous Row of Dataframe

Replace based on condtion

Spark Dataframe replace values from map:

HDP Spark Yarn

Setup PySpark,ORC,Hive on Python 3

wget -P /tmp/

ln -s /opt/miniconda3/bin/python3.6 /usr/bin/python3

Killing the process based on port id:

 sudo netstat -lutnp | grep -w '4041'
 sudo netstat -lutnp | grep -w '4042'
 sudo netstat -lutnp | grep -w '4043'
 sudo kill -9 36476