
.. _using_gpu:

=============
Using the GPU
=============

One of the Theano's design goals is to specify computations at an
abstract level, so that the internal function compiler has a lot of flexibility
about how to carry out those computations.  One of the ways we take advantage of
this flexibility is in carrying out calculations on an Nvidia graphics card when
there is a CUDA-enabled device in your computer.

Setting up CUDA
----------------

The first thing you'll need for Theano to use your GPU is Nvidia's
GPU-programming toolchain.  You should install at least the CUDA driver and the CUDA Toolkit, as 
`described here <http://www.nvidia.com/object/cuda_get.html>`_.  The CUDA
Toolkit installs a folder on your computer with subfolders *bin*, *lib*,
*include*, and some more too.  (Sanity check: The *bin* subfolder should contain an *nvcc*
program which is the compiler for GPU code.)  This folder is called the *cuda
root* directory.
On Linux or OS-X >= 10.4, you must add the 'lib' subdirectory (and/or 'lib64' subdirectory if you have a 64-bit
computer) to your ``$LD_LIBRARY_PATH`` environment variable.


Making Theano use CUDA
----------------------

You must tell Theano where the cuda root folder is, and there are three ways
to do it.
Any one of them is enough.

* Define a $CUDA_ROOT environment variable to equal the cuda root directory, as in ``CUDA_ROOT=/path/to/cuda/root``, or
* add a ``cuda.root`` flag to :envvar:`THEANO_FLAGS`, as in ``THEANO_FLAGS='cuda.root=/path/to/cuda/root'``, or
* add a [cuda] section to your .theanorc file containing the option ``root = /path/to/cuda/root``.

Once that is done, the only thing left is to change the ``device`` option to name the GPU device in your
computer.
For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu'``.
You can also set the device option in the .theanorc file's ``[global]`` section.

* If your computer has multiple gpu and use 'device=gpu', the driver select the one to use (normally gpu0)
    * You can use the program nvida-smi to change that policy.
* You can choose one specific gpu by giving device the one of those values: gpu0, gpu1, gpu2, or gpu3.  
    * If you have more than 4 devices you are very lucky but you'll have to modify theano's *configdefaults.py* file and define more gpu devices to choose from.
* Using the 'device=gpu*' theano flag make theano fall back to the cpu if their is a problem with the gpu. 
  You can use the flag 'force_device=True' to have theano raise an error when we can't use the gpu.

.. note::
    There is a compatibility issue affecting some Ubuntu 9.10 users, and probably anyone using
    CUDA 2.3 with gcc-4.4.  Symptom: errors about "__sync_fetch_and_add" being undefined.
    **Solution 1:** make gcc-4.3 the default gcc
    (http://pascalg.wordpress.com/2010/01/14/cuda-on-ubuntu-9-10linux-mint-helena/)
    **Solution 2:** make another gcc (e.g. gcc-4.3) the default just for nvcc. 
    Do this by making a directory (e.g. ``$HOME/.theano/nvcc-bindir``) and
    installing two symlinks in it: one called gcc pointing to gcc-4.3 (or lower) and one called
    g++ pointing to g++-4.3 (or lower).  Then add
    ``compiler_bindir = /path/to/nvcc-bindir`` to the ``[nvcc]`` section of your ``.theanorc``
    (`libdoc_config`).
    

Putting it all Together
-------------------------

To see if your GPU is being used, cut and paste the following program into a
file and run it.

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_using_gpu.test_using_gpu_1

.. code-block:: python

    from theano import function, config, shared, sandbox
    import theano.tensor as T
    import numpy
    import time

    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
    iters = 1000

    rng = numpy.random.RandomState(22)
    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
    f = function([], T.exp(x))
    t0 = time.time()
    for i in xrange(iters):
        r = f()
    print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
    print 'Result is', r
    print 'Used the','cpu' if numpy.any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'

The program just computes the exp() of a bunch of random numbers.
Note that we use the `shared` function to
make sure that the input `x` are stored on the graphics device.

If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds,
whereas on the GPU it takes just over 0.4 seconds.  Note that the results are close but not
identical!  The GPU will not always produce the exact same floating-point numbers as the CPU.
As a point of reference, a loop that calls ``numpy.exp(x.value)`` also takes about 7 seconds.

.. code-block:: text

    $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python thing.py 
    Looping 1000 times took 7.17374897003 seconds
    Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753 1.62323285]

    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python thing.py 
    Using gpu device 0: GeForce GTX 285
    Looping 1000 times took 0.418929815292 seconds
    Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

Note that for now GPU operations in Theano require floatX to be float32 (see below also).

Returning a handle to device-allocated data
-------------------------------------------

The speedup is not greater in the example above because the function is
returning its result as a numpy ndarray which has already been copied from the
device to the host for your convenience.  This is what makes it so easy to swap in device=gpu, but
if you don't mind being less portable, you might prefer to see a bigger speedup by changing
the graph to express a computation with a GPU-stored result.  The gpu_from_host
Op means "copy the input from the host to the gpu" and it is optimized away
after the T.exp(x) is replaced by a GPU version of exp().

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_using_gpu.test_using_gpu_2

.. code-block:: python

    from theano import function, config, shared, sandbox
    import theano.tensor as T
    import numpy
    import time

    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
    iters = 1000

    rng = numpy.random.RandomState(22)
    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
    f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
    t0 = time.time()
    for i in xrange(iters):
        r = f()
    print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
    print 'Result is', r
    print 'Numpy result is', numpy.asarray(r)
    print 'Used the','cpu' if numpy.any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'

The output from this program is

.. code-block:: text

    Using gpu device 0: GeForce GTX 285
    Looping 100 times took 0.185714006424 seconds
    Result is <CudaNdarray object at 0x3e9e970>
    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

Here we've shaved off about 50% of the run-time by simply not copying the
resulting array back to the host.
The object returned by each function call is now not a numpy array but a
"CudaNdarray" which can be converted to a numpy ndarray by the normal
numpy casting mechanism.


Running the GPU at Full Speed
------------------------------

To really get maximum performance in this simple example, we need to use an :class:`Out`
instance to tell Theano not to copy the output it returns to us.  Theano allocates memory for
internal use like a working buffer, but by default it will never return a result that is
allocated in the working buffer.  This is normally what you want, but our example is so simple
that it has the un-wanted side-effect of really slowing things down.

.. 
    TODO:
    The story here about copying and working buffers is misleading and potentially not correct
    ... why exactly does borrow=True cut 75% of the runtime ???

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_using_gpu.test_using_gpu_3
.. code-block:: python

    from theano import function, config, shared, sandbox, Out
    import theano.tensor as T
    import numpy
    import time

    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
    iters = 1000

    rng = numpy.random.RandomState(22)
    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
    f = function([], 
            Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
                borrow=True))
    t0 = time.time()
    for i in xrange(iters):
        r = f()
    print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
    print 'Result is', r
    print 'Numpy result is', numpy.asarray(r)
    print 'Used the','cpu' if numpy.any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'

Running this version of the code takes just under 0.05 seconds, over 140x faster than
the CPU implementation!

.. code-block:: text

    Using gpu device 0: GeForce GTX 285
    Looping 100 times took 0.0497219562531 seconds
    Result is <CudaNdarray object at 0x31eeaf0>
    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

This version of the code ``using borrow=True`` is slightly less safe because if we had saved
the `r` returned from one function call, we would have to take care and remember that its value might
be over-written by a subsequent function call.  Although borrow=True makes a dramatic difference in this example,
be careful!  The advantage of
borrow=True is much weaker in larger graphs, and there is a lot of potential for making a
mistake by failing to account for the resulting memory aliasing.


What can be accelerated on the GPU?
------------------------------------

The performance characteristics will change as we continue to optimize our
implementations, and vary from device to device, but to give a rough idea of
what to expect right now:

* Only computations 
  with float32 data-type can be accelerated. Better support for float64 is expected in upcoming hardware but
  float64 computations are still relatively slow (Jan 2010).  
* Matrix
  multiplication, convolution, and large element-wise operations can be
  accelerated a lot (5-50x) when arguments are large enough to keep 30
  processors busy.  
* Indexing,
  dimension-shuffling and  constant-time reshaping will be equally fast on GPU
  as on CPU.
* Summation 
  over rows/columns of tensors can be a little slower on the GPU than on the CPU
* Copying 
  of large quantities of data to and from a device is relatively slow, and
  often cancels most of the advantage of one or two accelerated functions on
  that data.  Getting GPU performance largely hinges on making data transfer to
  the device pay off.


Tips for improving performance on GPU
--------------------------------------

* Consider 
  adding ``floatX = float32`` to your .theanorc file if you plan to do a lot of
  GPU work.
* Prefer  
  constructors like 'matrix' 'vector' and 'scalar' to 'dmatrix', 'dvector' and
  'dscalar' because the former will give you float32 variables when
  floatX=float32.
* Ensure 
  that your output variables have a float32 dtype and not float64.  The
  more float32 variables are in your graph, the more work the GPU can do for
  you.
* Minimize 
  tranfers to the GPU device by using shared 'float32' variables to store
  frequently-accessed data (see :func:`shared()<shared.shared>`).  When using
  the GPU, 'float32' tensor shared variables are stored on the GPU by default to
  eliminate transfer time for GPU ops using those variables.
* If you aren't happy with the performance you see, try building your functions with 
  mode='PROFILE_MODE'. This should print some timing information at program
  termination (atexit). Is time being used sensibly?   If an Op or Apply is
  taking more time than its share, then if you know something about GPU
  programming have a look at how it's implemented in theano.sandbox.cuda.
  Check the line like 'Spent Xs(X%) in cpu Op, Xs(X%) in gpu Op and Xs(X%) transfert Op'
  that can tell you if not enought of your graph is on the gpu or if their
  is too much memory transfert.

