PySpark Kernel for Jupyter in Python 2.7 Windows

To run PySpark in the Jupyter notebook you need to do these steps:

  1. Download Spark and extract to C:\spark\spark-2.2.0-bin-hadoop2.7\  .  Do not forget to edit the conf/spark-defaults.conf to set the default spark params.
  2. Install Anaconda2,  e.g. to c:\Anaconda2
  3. Create a folder C:\Anaconda2\share\jupyter\kernels\pyspark22\    and put a file  kernel.json into it with the following contents:
{
 "display_name": "PySpark 2.2 with Python 2", 
 "language": "python", 
 "argv": [
  "C:\\Anaconda2\\python.exe", 
  "-m", 
  "ipykernel_launcher", 
  "-f", 
  "{connection_file}"
 ],
  "env": {
    "CAPTURE_STANDARD_OUT": "true",
    "CAPTURE_STANDARD_ERR": "true",
    "SEND_EMPTY_OUTPUT": "false",
    "SPARK_HOME": "C:\\spark\\spark-2.2.0-bin-hadoop2.7",
    "PYTHONPATH": "C:\\spark\\spark-2.2.0-bin-hadoop2.7\\python\\;C:\\spark\\spark-2.2.0-bin-hadoop2.7\\python\\lib\\py4j-0.10.4-src.zip",
    "PYTHONSTARTUP": "C:\\spark\\spark-2.2.0-bin-hadoop2.7\\python\\pyspark\\shell.py",
	"PYSPARK_SUBMIT_ARGS": "--master local[*] pyspark-shell"
 }
}

But as soon as you run jupyter notebook and create a notebook with this kernel, you will face this bug:

[E 19:16:23.072 NotebookApp] Unhandled error in API request
Traceback (most recent call last):
File "C:\Anaconda2\lib\site-packages\notebook\base\handlers.py", line 516, in wrapper
result = yield gen.maybe_future(method(self, *args, **kwargs))
File "C:\Anaconda2\lib\site-packages\tornado\gen.py", line 1055, in run
value = future.result()
File "C:\Anaconda2\lib\site-packages\tornado\concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "C:\Anaconda2\lib\site-packages\tornado\gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "C:\Anaconda2\lib\site-packages\notebook\services\sessions\handlers.py", line 75, in post
type=mtype))
File "C:\Anaconda2\lib\site-packages\tornado\gen.py", line 1055, in run
value = future.result()
File "C:\Anaconda2\lib\site-packages\tornado\concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "C:\Anaconda2\lib\site-packages\tornado\gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "C:\Anaconda2\lib\site-packages\notebook\services\sessions\sessionmanager.py", line 79, in create_session
kernel_id = yield self.start_kernel_for_session(session_id, path, name, type, kernel_name)
File "C:\Anaconda2\lib\site-packages\tornado\gen.py", line 1055, in run
value = future.result()
File "C:\Anaconda2\lib\site-packages\tornado\concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "C:\Anaconda2\lib\site-packages\tornado\gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "C:\Anaconda2\lib\site-packages\notebook\services\sessions\sessionmanager.py", line 92, in start_kernel_for_session
self.kernel_manager.start_kernel(path=kernel_path, kernel_name=kernel_name)
File "C:\Anaconda2\lib\site-packages\tornado\gen.py", line 1055, in run
value = future.result()
File "C:\Anaconda2\lib\site-packages\tornado\concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "C:\Anaconda2\lib\site-packages\tornado\gen.py", line 307, in wrapper
yielded = next(result)
File "C:\Anaconda2\lib\site-packages\notebook\services\kernels\kernelmanager.py", line 94, in start_kernel
super(MappingKernelManager, self).start_kernel(**kwargs)
File "C:\Anaconda2\lib\site-packages\jupyter_client\multikernelmanager.py", line 110, in start_kernel
km.start_kernel(**kwargs)
File "C:\Anaconda2\lib\site-packages\jupyter_client\manager.py", line 257, in start_kernel
**kw)
File "C:\Anaconda2\lib\site-packages\jupyter_client\manager.py", line 203, in _launch_kernel
return launch_kernel(kernel_cmd, **kw)
File "C:\Anaconda2\lib\site-packages\jupyter_client\launcher.py", line 138, in launch_kernel
proc = Popen(cmd, **kwargs)
File "C:\Anaconda2\lib\subprocess.py", line 390, in __init__
errread, errwrite)
File "C:\Anaconda2\lib\subprocess.py", line 640, in _execute_child
startupinfo)
TypeError: environment can only contain strings

The problem is the fact that C:\Anaconda2\Lib\site-packages\jupyter_client\kernelspec.py treats kernel.json as unicode (using json module), while C:\Anaconda2\Lib\site-packages\jupyter_client\launcher.py   does  a call to


proc = Popen(cmd, **kwargs)

which will fail in the case if kwargs[‘env’] contains either a key or a value as unicode string.

Therefore, you need to modify C:\Anaconda2\Lib\site-packages\jupyter_client\launcher.py to have this:

 try:
        # Ihor Bobak:  fix to convert all env keys and values to str
        klist = kwargs['env'].keys()[:]
        for key in klist:
            value = kwargs['env'][key]
            if isinstance(key, unicode) or isinstance(value, unicode):
                newkey = key.encode('ascii','ignore')
                newvalue = value.encode('ascii','ignore')
                del kwargs['env'][key]
                kwargs['env'][newkey] = newvalue
        # regular code
        proc = Popen(cmd, **kwargs) 
    except Exception as exc:
        msg = (
            "Failed to run command:\n{}\n"
            "    PATH={!r}\n"
            "    with kwargs:\n{!r}\n"
        )
)

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>