Install Anaconda in Cloudera

Go to parcels tab under Hosts.

parcels

Go to configuration and add another repository for Anaconda:

https://repo.continuum.io/pkgs/misc/parcels/

click save changes.

Now Anaconda repository will appear and you can download and distribute it on your hadoop cluster

downloading Anaconda
distributing Anaconda

After distributing process is done, you can run your pyspark using Anaconda dependency with PYSPARK_PYTHON in front of the spark-submit command.

PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda-2019.10/bin/python spark-submit count.py

Hope it helps.

--

--