
.. _introduction:


************
Introduction
************

Background Questionaire
-----------------------

 * Who has used Theano before?

   * What did you do with it?

 * Who has used Python? numpy? scipy? matplotlib?

 * Who has used iPython?

   * Who has used it as a distributed computing engine?

 * Who has done C/C++ programming?

 * Who has organized computation around a particular physical memory layout?

 * Who has used a multidimensional array of >2 dimensions?

 * Who has written a Python module in C before?

   * Who has written a program to *generate* Python modules in C?

 * Who has used a templating engine?

 * Who has programmed a GPU before?

   * Using OpenGL / shaders ?

   * Using CUDA (runtime? / driver?)

   * Using PyCUDA ?

   * Using OpenCL / PyOpenCL ?

   * Other?

 * Who has used Cython?


Python in one slide
-------------------

Features:

 * General-purpose high-level OO interpreted language
 
 * Emphasizes code readability
 
 * Comprehensive standard library
 
 * Dynamic type and memory management

 * builtin types: int, float, str, list, dict, tuple, object

Syntax sample:

.. code-block:: python

    a = {'a': 5, 'b': None}   # dictionary of two elements
    b = [1,2,3]               # list of three int literals

    def foo(b, c=3):          # function w default param c
        return a + b + c      # note scoping, indentation



 * List comprehension: ``[i+3 for i in range(10)]``

Numpy in one slide
------------------

 * Python floats are full-fledged objects on the heap

   * Not suitable for high-performance computing!

 * Numpy provides a N-dimensional numeric array in Python

   * Perfect for high-performance computing.

 * Numpy provides:

  * elementwise computations

  * linear algebra, Fourier transforms

  * pseudorandom numbers from many distributions

 * Scipy provides lots more, including:

  * more linear algebra

  * solvers and optimization algorithms

  * matlab-compatible I/O

  * I/O and signal processing for images and audio

Here are the properties of numpy arrays that you really need to know.

.. code-block:: python

    import numpy as np
    a = np.random.rand(3,4,5)
    a32 = a.astype('float32')

    a.ndim     # int: 3
    a.shape    # tuple: (3,4,5)
    a.size     # int: 60
    a.dtype    # np.dtype object: 'float64'
    a32.dtype  # np.dtype object: 'float32'

These arrays can be combined with numeric operators, standard mathematical
functions. Numpy has XXX great documentation XXX.

Training an MNIST-ready classification neural network in pure numpy might look like this:

.. code-block:: python

    x = np.load('data_x.npy')
    y = np.load('data_y.npy')
    w = np.random.normal(avg=0, std=.1,
        size=(784, 500))
    b = np.zeros(500)
    v = np.zeros((500, 10))
    c = np.zeros(10)

    for i in xrange(1000):
        x_i = x[i*batchsize:(i+1)*batchsize]
        y_i = y[i*batchsize:(i+1)*batchsize]

        hidin = N.dot(x_i, w) + b

        hidout = N.tanh(hidin)

        outin = N.dot(hidout, v) + c
        outout = (N.tanh(outin)+1)/2.0

        g_outout = outout - y_i
        err = 0.5 * N.sum(g_outout**2)

        g_outin = g_outout * outout * (1.0 - outout)

        g_hidout = N.dot(g_outin, v.T)
        g_hidin = g_hidout * (1 - hidout**2)

        b -= lr * N.sum(g_hidin, axis=0)
        c -= lr * N.sum(g_outin, axis=0)
        w -= lr * N.dot(x_i.T, g_hidin)
        v -= lr * N.dot(hidout.T, g_outin)


What's missing?
---------------

 * Non-lazy evaluation (required by Python) hurts performance

 * Numpy is bound to the CPU

 * Numpy lacks symbolic or automatic differentiation

Here's how the algorithm above looks in Theano, and it runs 15 times faster if
you have GPU (I'm skipping some dtype-details which we'll come back to):

.. code-block:: python

    import theano as T
    import theano.tensor as TT

    x = np.load('data_x.npy')
    y = np.load('data_y.npy')

    # symbol declarations
    sx = TT.matrix()
    sy = TT.matrix()
    w = T.shared(np.random.normal(avg=0, std=.1,
        size=(784, 500)))
    b = T.shared(np.zeros(500))
    v = T.shared(np.zeros((500, 10)))
    c = T.shared(np.zeros(10))

    # symbolic expression-building
    outout = TT.tanh(TT.dot(TT.tanh(TT.dot(sx, w.T) + b), v.T) + c)
    err = 0.5 * TT.sum(outout - sy)**2
    gw, gb, gv, gc = TT.grad(err, [w,b,v,c])

    # compile a fast training function
    train = function([sx, sy], cost,
        updates={
            w:w - lr * gw,
            b:b - lr * gb,
            v:v - lr * gv,
            c:c - lr * gc})

    # now do the computations
    for i in xrange(1000):
        x_i = x[i*batchsize:(i+1)*batchsize]
        y_i = y[i*batchsize:(i+1)*batchsize]
        err_i = train(x_i, y_i)

    
Theano in one slide
-------------------

 * High-level domain-specific language tailored to numeric computation

 * Compiles most common expressions to C for CPU and GPU.

 * Limited expressivity means lots of opportunities for expression-level optimizations
   * No function call -> global optimization

   * Strongly typed -> compiles to machine instructions

   * Array oriented -> parallelizable across cores

 * Expression substitution optimizations automatically draw
   on many backend technologies for best performance.

   * FFTW, MKL, ATLAS, Scipy, Cython, CUDA

   * Slower fallbacks always available

 * It used to have no/poor support for internal looping and conditional
   expressions, but these are now quite usable.
 

Project status
--------------

 * Mature: theano has been developed and used since January 2008 (3.5 yrs old)
 * Driven over 40 research papers in the last few years
 * Core technology for a funded Silicon-Valley startup
 * Good user documentation
 * Active mailing list with participants from outside our lab
 * Many contributors (some from outside our lab)
 * Used to teach IFT6266 for two years
 * Used for research at Google and Yahoo.
 * Unofficial RPMs for Mandriva
 * Downloads (on June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown


Why scripting for GPUs ?
------------------------

They *Complement each other*:

- GPUs are everything that scripting/high level languages are not

  - Highly parallel
  - Very architecture-sensitive
  - Built for maximum FP/memory throughput
  - So hard to program that meta-programming is easier.

- CPU: largely restricted to control

  - Optimized for sequential code and low latency (rather than high throughput)
  - Tasks (1000/sec)
  - Scripting fast enough

Best of both: scripted CPU invokes JIT-compiled kernels on GPU.


How Fast are GPUs?
------------------

 - Theory:

  - Intel Core i7 980 XE (107Gf/s float64) 6 cores
  - NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
  - NVIDIA GTX580 (1.5Tf/s float32) 512 cores
  - GPUs are faster, cheaper, more power-efficient

 - Practice: 
  - Depends on algorithm and implementation!
  - Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
  - Matrix-matrix multiply speedup: usually about 10-20x.
  - Convolution speedup: usually about 15x.
  - Elemwise speedup: slower or up to 100x (depending on operation and layout)
  - Sum: can be faster or slower depending on layout.

 - Benchmarking is delicate work...
   - How to control quality of implementation?
     - How much time was spent optimizing CPU vs GPU code?
   - Theano goes up to 100x faster on GPU because it uses only one CPU core
   - Theano can be linked with multi-core capable BLAS (GEMM and GEMV)

 - If you see speedup > 100x, the benchmark is probably not fair.


Software for Directly Programming a GPU
---------------------------------------

Theano is a meta-programmer, doesn't really count.

 - CUDA: C extension by NVIDIA 
   - Vendor-specific
   - Numeric libraries (BLAS, RNG, FFT) maturing.
 - OpenCL: multi-vendor version of CUDA
   - More general, standardized
   - Fewer libraries, less adoption.
 - PyCUDA: python bindings to CUDA driver interface
   - Python interface to CUDA
   - Memory management of GPU objects
   - Compilation of code for the low-level driver
   - Makes it easy to do GPU meta-programming from within Python
 - PyOpenCL: PyCUDA for PyOpenCL
