Hortonworks Data platform Pyspark on Python3

Published:

THis post you will learn how to setup and configure HDP to work with Python 3.

Step1: Using conda to install Python 3.6

sudo yum install epel-release
sudo yum install R -y
export LD_LIBRARY_PATH=/usr/local/lib
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -P /tmp/
bash /tmp/Miniconda3-latest-Linux-x86_64.sh
conda install python=3.6.5
conda list
conda update conda
conda install python=3.6.5
sudo yum install readline-devel
ln -s /opt/miniconda3/bin/python3.6 /usr/bin/python3
ln -s /opt/miniconda3/bin/pip /usr/bin/pip

Step 2: Installing Dependencies

1. Create requirements.txt in /tmp folder
``
numpy==1.15.3
pandas==0.23.4
pyspark==2.4.0

Installing the dependencies by running the below command

pip install -r /tmp/requirements.txt

Setup python3.6 has spark default environment:

export SPARK_PRINT_LAUNCH_COMMAND=1 
export PYSPARK_PYTHON=python3.6
export SPARK_HISTORY_OPTS= set to hdfs file`