
.. _theano:

******
Theano
******

Pointers
--------

- http://deeplearning.net/software/theano/
- Announcements mailing list: http://groups.google.com/group/theano-announce
- User mailing list: http://groups.google.com/group/theano-users
- Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
- Installation: https://deeplearning.net/software/theano/install.html


Description
-----------

- Mathematical symbolic expression compiler
- Dynamic C/CUDA code generation
- Efficient symbolic differentiation
 
  - Theano computes derivatives of functions with one or many inputs.

- Speed and stability optimizations

  - Gives the right answer for ``log(1+x)`` even if x is really tiny.
  
- Works on Linux, Mac and Windows
- Transparent use of a GPU

  - float32 only for now (working on other data types)
  - Doesn't work on Windows for now
  - On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x

- Extensive unit-testing and self-verification

  - Detects and diagnoses many types of errors
  
- On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives

  - including specialized implementations in C/C++, NumPy, SciPy, and Matlab

- Expressions mimic NumPy's syntax & semantics
- Statically typed and purely functional
- Some sparse operations (CPU only)
- The project was started by James Bergstra and Olivier Breuleux
- For the past 1-2 years, I have replaced Olivier as lead contributor

Simple example
--------------

>>> import theano
>>> a = theano.tensor.vector("a")      # declare symbolic variable
>>> b = a + a**10                      # build symbolic expression
>>> f = theano.function([a], b)        # compile function
>>> print f([0,1,2])                   # prints `array([0,2,1026])`


==================================  ==================================
        Unoptimized graph                    Optimized graph
==================================  ==================================
.. image:: pics/f_unoptimized.png   .. image:: pics/f_optimized.png
==================================  ==================================

Symbolic programming

- Paradigm shift: people need to use it to understand it

Exercise 1
-----------

.. code-block:: python

  import theano
  a = theano.tensor.vector("a") # declare variable
  b = a + a**10                 # build symbolic expression
  f = theano.function([a], b)   # compile function
  print f([0,1,2])
  # prints `array([0,2,1026])`
  
  theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
  theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)

Modify and execute the example to do this expression: a**2 + b**2 + 2*a*b

Real example
------------

**Logistic Regression**

- GPU-ready
- Symbolic differentiation
- Speed optimizations
- Stability optimizations

.. code-block:: python

  import numpy
  import theano
  import theano.tensor as T
  rng = numpy.random
  
  N = 400
  feats = 784
  D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
  training_steps = 10000
  
  # Declare Theano symbolic variables
  x = T.matrix("x")
  y = T.vector("y")
  w = theano.shared(rng.randn(100), name="w")
  b = theano.shared(0., name="b")
  print "Initial model:"
  print w.get_value(), b.get_value()

  # Construct Theano expression graph
  p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))     # Probability that target = 1
  prediction = p_1 > 0.5                    # The prediction thresholded
  xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy loss function
  cost = xent.mean() + 0.01*(w**2).sum()    # The cost to minimize
  gw,gb = T.grad(cost, [w,b])

  # Compile
  train = theano.function(
            inputs=[x,y],
            outputs=[prediction, xent],
            updates={w:w-0.1*gw, b:b-0.1*gb})
  predict = theano.function(inputs=[x], outputs=prediction)

  # Train
  for i in range(training_steps):
      pred, err = train(D[0], D[1])

  print "Final model:"
  print w.get_value(), b.get_value()
  print "target values for D:", D[1]
  print "prediction on D:", predict(D[0])


**Optimizations:**

.. code-block:: python

  p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))
  # 1 / (1 + T.exp(var)) -> sigmoid(var)
  xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1)
  # Log(1-sigmoid(var)) -> -sigmoid(var)
  prediction = p_1 > 0.5
  cost = xent.mean() + 0.01*(w**2).sum()
  gw,gb = T.grad(cost, [w,b])

  train = theano.function(
            inputs=[x,y],
            outputs=[prediction, xent],
            # w-0.1*gw: GEMV with the dot in the grad
            updates={w:w-0.1*gw, b:b-0.1*gb})


Where are those optimization applied?

- ``log(1+exp(x))``
- ``1 / (1 + T.exp(var))`` (sigmoid)
- ``log(1-sigmoid(var))`` (softplus, stabilisation)
- GEMV (matrix-vector multiply from BLAS)
- Loop fusion


Theano flags
------------

Theano can be configured with flags. They can be defined in two ways

- With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
- With a configuration file that defaults to ``~.theanorc``


Exercise 2
-----------

.. code-block:: python
    
    import numpy
    import theano
    import theano.tensor as T
    rng = numpy.random

    N = 400
    feats = 784
    D = (rng.randn(N, feats).astype(theano.config.floatX),
    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
    training_steps = 10000

    # Declare Theano symbolic variables
    x = T.matrix("x")
    y = T.vector("y")
    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
    x.tag.test_value = D[0]
    y.tag.test_value = D[1]
    #print "Initial model:"
    #print w.get_value(), b.get_value()


    # Construct Theano expression graph
    p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
    prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
    xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
    cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
    gw,gb = T.grad(cost, [w,b])

    # Compile expressions to functions
    train = theano.function(
                inputs=[x,y],
                outputs=[prediction, xent],
                updates={w:w-0.01*gw, b:b-0.01*gb},
                name = "train")
    predict = theano.function(inputs=[x], outputs=prediction,
                name = "predict")

    if any( [x.op.__class__.__name__=='Gemv' for x in
    train.maker.env.toposort()]):
        print 'Used the cpu'
    elif any( [x.op.__class__.__name__=='GpuGemm' for x in
    train.maker.env.toposort()]):
        print 'Used the gpu'
    else:
        print 'ERROR, not able to tell if theano used the cpu or the gpu'
        print train.maker.env.toposort()



    for i in range(training_steps):
        pred, err = train(D[0], D[1])
    #print "Final model:"
    #print w.get_value(), b.get_value()

    print "target values for D"
    print D[1]

    print "prediction on D"
    print predict(D[0])

    # Print the graph used in the slides
    theano.printing.pydotprint(predict,
                               outfile="pics/logreg_pydotprint_predic.png",
                               var_with_name_simple=True)
    theano.printing.pydotprint_variables(prediction,
                               outfile="pics/logreg_pydotprint_prediction.png",
                               var_with_name_simple=True)
    theano.printing.pydotprint(train,
                               outfile="pics/logreg_pydotprint_train.png",
                               var_with_name_simple=True)

Modify and execute the example to run on CPU with floatX=float32

* You will need to use: ``theano.config.floatX`` and ``ndarray.astype("str")``

GPU
---

- Only 32 bit floats are supported (being worked on)
- Only 1 GPU per process
- Use the Theano flag ``device=gpu`` to tell to use the GPU device
  
  - Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
  - Shared variables with float32 dtype are by default moved to the GPU memory space

- Use the Theano flag ``floatX=float32``

  - Be sure to use ``floatX`` (``theano.config.floatX``) in your code
  - Cast inputs before putting them into a shared variable
  - Cast "problem": int32 with float32 to float64
    
    - A new casting mechanism is being developed
    - Insert manual cast in your code or use [u]int{8,16}
    - Insert manual cast around the mean operator (which involves a division by the length, which is an int64!)



Exercise 3
-----------

- Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU
- Time with: ``time python file.py``

Symbolic variables
------------------

- # Dimensions
    
  - T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4

- Dtype

  - T.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
  - T.vector to floatX dtype
  - floatX: configurable dtype that can be float32 or float64.

- Custom variable
  - All are shortcuts to: ``T.tensor(dtype, broadcastable=[False]*nd)``
  - Other dtype: uint[8,16,32,64], floatX

Creating symbolic variables: Broadcastability
  - Remember what I said about broadcasting?
  - How to add a row to all rows of a matrix?
  - How to add a column to all columns of a matrix?
  

- Broadcastability must be specified when creating the variable
- The only shorcut with broadcastable dimensions are: **T.row** and **T.col**
- For all others: ``T.tensor(dtype, broadcastable=([False or True])*nd)``


Differentiation details
-----------------------

>>> gw,gb = T.grad(cost, [w,b])

- T.grad works symbolically: takes and returns a Theano variable
- T.grad can be compared to a macro: it can be applied multiple times
- T.grad takes scalar costs only
- Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
- We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector



Benchmarks
----------

Example:

- Multi-layer perceptron
- Convolutional Neural Networks
- Misc Elemwise operations

Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr

- EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks
- numexpr: similar to Theano, 'virtual machine' for elemwise expressions

**Multi-Layer Perceptron**:

60x784 matrix times 784x500 matrix, tanh, times 500x10 matrix, elemwise, then all in reverse for backpropagation

.. image:: pics/mlp.png

**Convolutional Network**: 

256x256 images convolved with 6 7x7 filters,
downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise
tanh, matrix multiply, softmax elementwise, then in reverse

.. image:: pics/conv.png

**Elemwise**

- All on CPU
- Solid blue: Theano
- Dashed Red: numexpr (without MKL)

.. image:: pics/multiple_graph.png
