MiniFlow — Понимание того, как TensorFlow работает на высоком уровне.

Вдохновился курсом Udacity DL год назад :)

Это репост моего блога Github.io, который можно найти

MiniFlow
Веса входного слоя Выходной слой Смещение Функция активации прямого распространения — сигмовидная ошибка обратного распространения…dhiraa.github.io

Пожалуйста, используйте ссылку выше для хорошего форматированного материала. Простите за неудобства.

Я думаю, что среда имеет большую видимость, чем github.io, что является одной из причин этого быстрого взлома, чтобы поделиться своим обучением MiniFlow с другими.

Настоятельно рекомендую использовать ссылку выше, спасибо!

Я столкнулся с некоторыми проблемами с MathJax, и они будут исправлены, когда у меня будет достаточно времени, чтобы поиграть с поддержкой математических символов Latex в Medium :). время ;)

Настраивать

http://jupyter.readthedocs.io/en/latest/install.html

$ pip install jupyter
$ pip install numpy
$ pip install matplotlib

Машинное обучение

Под наблюдением
Без присмотра
Полууправляемый

---------------- 
                                           |                |
                                           v                |
Data -> Preprocessing -> Batching -> Model Training -> Evaluation -> Required Accuracy? -> Deploy

Пробная модель (под наблюдением)

(Input Features, Target(s)) -> Intialize Model      ->    Predict the Target ->      Calculate the Error
                               Parameters Randomly                       ^                            |                                                                              |                            |
                                                                         |                            |
                                                                         |                            |
                                                                         |                            |
                                                                          -------Update Parameters----

Начнем с демо

Жаргон нейронных сетей:

Входной слой
Выходной слой
Веса
Предвзятость
Прямое распространение
Функция активации — сигмовидная
Обратное распространение
Расчет ошибки
Регуляризация
Градиентный спуск

Контур:

Абстрагируйте простой нейронный узел/нейрон
С некоторыми свойствами ввода/вывода
Как сделать пас вперед
Как рассчитать градиенты и выполнить обратный проход
Создайте входной узел
Создайте некоторые операции, такие как Add, Multiply, Liner Transformation, Sigmoid, Mean Square Error.
Узнайте, как выполняется прямой проход во всей операции
Изучите математику, стоящую за обратным распространением (не волнуйтесь, даже мне страшно!)
Узнайте, как рассчитываются градиенты
Узнайте, как это реализовано
SGD — простой алгоритм адаптации весов для предсказания спуска.

Моделирование программного обеспечения против проектирования программного обеспечения

«Моделирование» — это описание того, что вы знаете. Хорошая модель делает правильные утверждения. Выражение научной теории или алгоритма в программном обеспечении.

Проектирование программного обеспечения — это процесс определения программных методов, функций, объектов, а также общей структуры и взаимодействия вашего кода, чтобы итоговая функциональность удовлетворяла требованиям ваших пользователей.

Реализация нейронного узла

import numpy as np
import matplotlib.pyplot as plot
%matplotlib inline
DEBUG = True
class Node(object):
    """
    Base class for nodes in the network.
    Should have following properties:
    1. Should hold its value
    2. Should know what are incoming nodes
    3. Should know to which node(s) it outputs the value
    4. Should hold the gradient calculated
    Arguments:
        `inbound_nodes`: A list of nodes with edges into this node.
    """
    def __init__(self, inbound_nodes=[]):
        """
        Node's constructor (runs when the object is instantiated). Sets
        properties that all nodes need.
        """
        self.name = "Node"
        
        # The eventual value of this node. Set by running
        # the forward() method.
        self.value = None
        
        # A list of nodes with edges into this node.
        # Just like input arguments to any function/method
        self.inbound_nodes = inbound_nodes
        # A list of nodes that this node outputs to.
        # Is it possible to know which node I am gonna send the result? Definelty NO!!!
        self.outbound_nodes = []
        
        # Keys are the inputs to this node and
        # their values are the partials of this node with
        # respect to that input.
        self.gradients = {}
        
        # Sets this node as an outbound node for all of
        # this node's inputs.
        # Hey there I am your output node, do send me your results, ok!
        for node in inbound_nodes:
            node.outbound_nodes.append(self)
    def forward(self):
        """
        Every node that uses this class as a base class will
        need to define its own `forward` method.
        """
        raise NotImplementedError
    def backward(self):
        """
        Every node that uses this class as a base class will
        need to define its own `backward` method.
        """
        raise NotImplementedError
class Input(Node):
    """
    A generic input into the network.
    """
    def __init__(self, name='Input'):
        # The base class constructor has to run to set all
        # the properties here.
        #
        # The most important property on an Input is value.
        # self.value is set during `topological_sort` later.
        Node.__init__(self)
        self.name = name
    # NOTE: Input node is the only node where the value
    # may be passed as an argument to forward().
    #
    # All other node implementations should get the value
    # of the previous node from self.inbound_nodes
    #
    # Example:
    # val0 = self.inbound_nodes[0].value
    def forward(self, value=None):
        # Overwrite the value if one is passed in.
        if(DEBUG) : print("\n----->Forward pass @ " ,self.name)
        if value is not None:
            self.value = value
            if(DEBUG) : print("w.r.t {} node of value: {} ".format(self.name, self.value))
            
    def backward(self):
        # An Input node has no inputs so the gradient (derivative)
        # is zero.
        # The key, `self`, is reference to this object.
        self.gradients = {self: 0}
        # Weights and bias may be inputs, so you need to sum
        # the gradient from output gradients.
        if(DEBUG) : print('\n')
        if(DEBUG) : print('=============================\n\tBP @ {}\n=============================\n'.format(self.name))
        if(DEBUG) : print('Initial Gradients:\n------------------')
        if(DEBUG) : print('W.r.t {}: \n------------\n{}'.format(self.name,self.gradients[self]))
            
        for n in self.outbound_nodes:
            grad_cost = n.gradients[self]
            
            if(DEBUG) : print('\n')
            if(DEBUG) : print('Getting ', n.name, 'gradient : \n<-----------------------------\n', grad_cost)
            if(DEBUG) : print('\n')
                
            self.gradients[self] += grad_cost * 1
            
        if(DEBUG) : print('Calculated Final Gradient:(Note: Calculated by next node in the graph!!!)\n----------------')
        if(DEBUG) : print('W.r.t ', self.name, ' : \n-------------\n', self.gradients[self])

ДАГ

Топологическая сортировка:

https://en.wikipedia.org/wiki/Топологическая_сортировка
http://www.geeksforgeeks.org/topological-sorting/ (о да, мне нужно сохранить это при подготовке к интервью!)
Чтобы определить свою сеть, вам нужно определить порядок операций для ваших узлов. Учитывая, что входные данные для одного узла зависят от выходных данных других, вам необходимо сгладить график таким образом, чтобы все входные зависимости для каждого узла были разрешены, прежде чем пытаться запустить его вычисление. Это метод, называемый топологической сортировкой.

T_DEBUG = True
def topological_sort(feed_dict):
    """
    Sort the nodes in topological order using Kahn's Algorithm.
    `feed_dict`: A dictionary where the key is a `Input` Node and the value is 
    the respective value feed to that Node.
    Returns a list of sorted nodes.
    """
    if T_DEBUG: print('-----> topological_sort')
    input_nodes = [n for n in feed_dict.keys()]
    G = {}
    nodes = [n for n in input_nodes]
    
    if T_DEBUG: print('Input Nodes:'); [print(n.name) for n in input_nodes]
        
    while len(nodes) > 0:
        n = nodes.pop(0)
        
        if T_DEBUG: print('Pop: ', n.name)
            
        if n not in G:
            if T_DEBUG: print('Adding: ', n.name, 'to the Graph')
            G[n] = {'in': set(), 'out': set()}
            
        for m in n.outbound_nodes:
            if m not in G: 
                if T_DEBUG: print('Adding: ', m.name, 'to the Graph')
                G[m] = {'in': set(), 'out': set()}
                
            G[n]['out'].add(m); 
            if T_DEBUG: print('Adding', n.name, '----->', m.name)
                
            G[m]['in'].add(n); 
            if T_DEBUG: print('Adding', m.name, '<-----', n.name)
                
            nodes.append(m)
            if T_DEBUG: print('Appending ', m.name)
          
    L = []
    S = set(input_nodes)
    if T_DEBUG: print('Input Nodes:'); [print(n.name) for n in S]
    while len(S) > 0:
        n = S.pop()
        if T_DEBUG: print('Pop: ', n.name)
        #Assign values to the input node
        if isinstance(n, Input):
            if T_DEBUG: print('Feeding value: ', feed_dict[n], ' =====>  ', n.name)
            n.value = feed_dict[n]
        L.append(n)
        if T_DEBUG: print('Adding ', n.name, 'to the sorted List')
        for m in n.outbound_nodes:
            G[n]['out'].remove(m)
            G[m]['in'].remove(n)
            if T_DEBUG: print('Removing', n.name, '----->', m.name)
            if T_DEBUG: print('Removing', m.name, '<-----', n.name)
            # if no other incoming edges add to S
            if len(G[m]['in']) == 0:
                if T_DEBUG: print('\nNo input nodes!!! Adding: ', m.name, 'to the Graph\n')
                S.add(m)
    
    if T_DEBUG: print('Sorted Nodes:\n'); [print(n.name) for n in L]
    
    if T_DEBUG: print('<------------------------------------ topological_sort')
        
    return L
def forward_pass(output_node, sorted_nodes):
    """
    Performs a forward pass through a list of sorted nodes.
    Arguments:
        `output_node`: A node in the graph, should be the output node (have no outgoing edges).
        `sorted_nodes`: A topologically sorted list of nodes.
    Returns the output Node's value
    """
    for n in sorted_nodes:
        n.forward()
    return output_node.value

Давайте определим наши операции

class Add(Node):
    def __init__(self, *inputs):
        Node.__init__(self, inputs)
        self.name = "Add_Op"
    def forward(self):
        """
        For reference, here's the old way from the last
        quiz. You'll want to write code here.
        """
        self.value = 0
        for i in range(len(self.inbound_nodes)):
            
            if(DEBUG) : print("Initial value of {} is {}".format(self.name, self.value))
                
            self.value +=  self.inbound_nodes[i].value
            
            if(DEBUG) : print("{}:{} ---> {}:{}".format(self.inbound_nodes[i].name, self.inbound_nodes[i].value, 
                                           self.name, self.value))
x, y, z = Input('x'), Input('y'), Input('z')
f = Add(x, y, z)
feed_dict = {x: 4, y: 5, z: 10}
graph = topological_sort(feed_dict)
addition_res = forward_pass(f, graph)
# should output 19
print("{} + {} + {} = {} (according to miniflow)".format(feed_dict[x], feed_dict[y], feed_dict[z], addition_res))
-----> topological_sort
Input Nodes:
x
y
z
Pop:  x
Adding:  x to the Graph
Adding:  Add_Op to the Graph
Adding x -----> Add_Op
Adding Add_Op <----- x
Appending  Add_Op
Pop:  y
Adding:  y to the Graph
Adding y -----> Add_Op
Adding Add_Op <----- y
Appending  Add_Op
Pop:  z
Adding:  z to the Graph
Adding z -----> Add_Op
Adding Add_Op <----- z
Appending  Add_Op
Pop:  Add_Op
Pop:  Add_Op
Pop:  Add_Op
Input Nodes:
z
y
x
Pop:  z
Feeding value:  10  =====>   z
Adding  z to the sorted List
Removing z -----> Add_Op
Removing Add_Op <----- z
Pop:  y
Feeding value:  5  =====>   y
Adding  y to the sorted List
Removing y -----> Add_Op
Removing Add_Op <----- y
Pop:  x
Feeding value:  4  =====>   x
Adding  x to the sorted List
Removing x -----> Add_Op
Removing Add_Op <----- x
No input nodes!!! Adding:  Add_Op to the Graph
Pop:  Add_Op
Adding  Add_Op to the sorted List
Sorted Nodes:
z
y
x
Add_Op
<------------------------------------ topological_sort
----->Forward pass @  z
----->Forward pass @  y
----->Forward pass @  x
Initial value of Add_Op is 0
x:4 ---> Add_Op:4
Initial value of Add_Op is 4
y:5 ---> Add_Op:9
Initial value of Add_Op is 9
z:10 ---> Add_Op:19
4 + 5 + 10 = 19 (according to miniflow)
T_DEBUG = False
class Mul(Node):
    def __init__(self, *inputs):
        Node.__init__(self, inputs)
        self.name = "Mul_Op"
    def forward(self):
        """
        For reference, here's the old way from the last
        quiz. You'll want to write code here.
        """
        self.value = 1
        for i in range(len(self.inbound_nodes)):
            if(DEBUG) : print("Initial value of {} is {}".format(self.name, self.value))
                
            self.value *=  self.inbound_nodes[i].value
            
            if(DEBUG) : print("{}:{} ---> {}:{}".format(self.inbound_nodes[i].name, self.inbound_nodes[i].value, 
                                           self.name, self.value))
            
        # x_value = self.inbound_nodes[0].value
        # y_value = self.inbound_nodes[1].value
        # self.value = x_value + y_value
x, y, z = Input('x'), Input('y'), Input('z')
f1 = Mul(x, y, z)
feed_dict = {x: 4, y: 5, z: 10}
graph = topological_sort(feed_dict)
product = forward_pass(f1, graph)
# should output 19
print("\n{} * {} * {} = {} (according to miniflow)".format(feed_dict[x], feed_dict[y], feed_dict[z], product))
----->Forward pass @  x
----->Forward pass @  z
----->Forward pass @  y
Initial value of Mul_Op is 1
x:4 ---> Mul_Op:4
Initial value of Mul_Op is 4
y:5 ---> Mul_Op:20
Initial value of Mul_Op is 20
z:10 ---> Mul_Op:200
4 * 5 * 10 = 200 (according to miniflow)

Линейное преобразование

Простой искусственный нейрон зависит от трех компонентов:

входы, xi
веса, с
уклон, б

Выход, y

, это просто взвешенная сумма входных данных плюс смещение.

y=∑ixiwi+b

Градиент:

∂y∂X=W

∂y∂W=X ∂y∂b=1

a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
np.dot(a,b)
array([[19, 22],
       [43, 50]])
class Linear(Node):
    """
    Represents a node that performs a linear transform.
    """
    def __init__(self, X, W, b):
        # The base class (Node) constructor. Weights and bias
        # are treated like inbound nodes.
        Node.__init__(self, [X, W, b])
        self.name = "Linear_OP"
    def forward(self):
        """
        Performs the math behind a linear transform.
        """
        self.X = self.inbound_nodes[0]
        self.W = self.inbound_nodes[1]
        self.b = self.inbound_nodes[2]
        self.value = np.dot(self.X.value,self.W.value) + self.b.value
        
        
        if(DEBUG) :  print("\n----->Forward pass @ " ,self.name)
        if(DEBUG) : print("{}:\n{} * \n{}:\n{} + \n{}:\n{} =\n {}:\n{}".format(self.X.name,self.X.value,
                                                                  self.W.name, self.W.value,
                                                                  self.b.name, self.b.value,
                                                                  self.name, self.value))
    def backward(self):
        """
        Calculates the gradient based on the output values.
        """
        # Initialize a partial for each of the inbound_nodes.
        self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}
        # Cycle through the outputs. The gradient will change depending
        # on each output, so the gradients are summed over all outputs.
        
        if(DEBUG) : print('\n')
        if(DEBUG) : print('=============================\n\tBP @ Linear\n=============================\n')
        if(DEBUG) : print('Initial Gradients:\n------------------')
        if(DEBUG) : print('W.r.t {}: \n---------------\n{}'.format(self.X.name, self.gradients[self.X]))
        if(DEBUG) : print('W.r.t {}: \n---------------\n{}'.format(self.W.name, self.gradients[self.W]))
        if(DEBUG) : print('W.r.t {}: \n---------------\n{}'.format(self.b.name, self.gradients[self.b]))
            
        for n in self.outbound_nodes:
            # Get the partial of the cost with respect to this node.
            # The out is mostly only one node, a activation function!(sigmoid here)
            grad_cost = n.gradients[self]
            
            if(DEBUG) : print('\n')
            if(DEBUG) : print('Getting ', n.name, 'gradient is : \n<-----------------------------\n', grad_cost)
            if(DEBUG) : print('\n')
                
            # Get the gradient for this node from next node and respective operation 
            # (mutliply/add) with each input of this node to set their respective gradients
            # Set the partial of the loss with respect to this node's inputs.
            self.gradients[self.X] += np.dot(grad_cost, self.W.value.T)
            # Set the partial of the loss with respect to this node's weights.
            self.gradients[self.W] += np.dot(self.X.value.T, grad_cost)
            # Set the partial of the loss with respect to this node's bias.
            self.gradients[self.b] += np.sum(grad_cost, axis=0, keepdims=False)
            
        if(DEBUG) : print('Calculated Final Gradient:\n----------------')
        if(DEBUG) : print('W.r.t ',self.X.name,': \n-------------\n', self.gradients[self.inbound_nodes[0]])
        if(DEBUG) : print('W.r.t ',self.W.name,': \n-------------\n', self.gradients[self.inbound_nodes[1]])
        if(DEBUG) : print('W.r.t ',self.b.name,': \n-------------\n', self.gradients[self.inbound_nodes[2]])

X, W, b = Input('X'), Input('W'), Input('b')
f = Linear(X, W, b)
X_ = np.array([[1., 2.], [3., 4.]])
W_ = np.array([[5., 6.], [7., 8.]])
b_ = np.array([-1., -1])
feed_dict = {X: X_, W: W_, b: b_}
graph = topological_sort(feed_dict)
output = forward_pass(f, graph)
print("\n\noutput: \n", output)
----->Forward pass @  X
----->Forward pass @  b
----->Forward pass @  W
----->Forward pass @  Linear_OP
X:
[[ 1.  2.]
 [ 3.  4.]] * 
W:
[[ 5.  6.]
 [ 7.  8.]] + 
b:
[-1. -1.] =
 Linear_OP:
[[ 18.  21.]
 [ 42.  49.]]

output: 
 [[ 18.  21.]
 [ 42.  49.]]

сигмовидная

https://en.wikipedia.org/wiki/Sigmoid_function

S(x)=11+e−x

Градиент: ∂S(x)∂x=S(x)∗(1−S(x))

∂стоимость∂S(x)=градусность

class Sigmoid(Node):
    """
    Represents a node that performs the sigmoid activation function.
    """
    def __init__(self, node):
        # The base class constructor.
        Node.__init__(self, [node])
        self.name = "Sigmoid_Op"
    def _sigmoid(self, x):
        """
        This method is separate from `forward` because it
        will be used with `backward` as well.
        `x`: A numpy array-like object.
        """
        return 1. / (1. + np.exp(-x))
    def forward(self):
        """
        Perform the sigmoid function and set the value.
        """
        if(DEBUG) : print("\n----->Forward pass @ " ,self.name)
        if(DEBUG) : print("Initial value of {} is {}".format(self.name, self.value))
        input_value = self.inbound_nodes[0].value
        self.value = self._sigmoid(input_value)
        if(DEBUG) : print("{}:\n{} ---> {}:\n{}".format(self.inbound_nodes[0].name, self.inbound_nodes[0].value, 
                                           self.name, self.value))
    def backward(self):
        """
        Calculates the gradient using the derivative of
        the sigmoid function.
        """
        # Initialize the gradients to 0.
        self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}
        if(DEBUG) : print('\n')
        if(DEBUG) : print('=============================\n\tBP @ Sigmoid\n=============================\n')
        if(DEBUG) : print('Initial Gradients:\n------------------')
        if(DEBUG) : print('W.r.t ', self.inbound_nodes[0].name, ': \n----------------\n', self.gradients[self.inbound_nodes[0]])
        
        # Cycle through the outputs. The gradient will change depending
        # on each output, so the gradients are summed over all outputs.
        for n in self.outbound_nodes:
            # Get the partial of the cost with respect to this node.
            grad_cost = n.gradients[self] #For eg. get it from MSE
            
            if(DEBUG) : print('\n')
            if(DEBUG) : print('Getting ', n.name, 'gradient : \n<-----------------------------\n', grad_cost)
            if(DEBUG) : print('\n')
                
            sigmoid = self.value
            self.gradients[self.inbound_nodes[0]] += sigmoid * (1 - sigmoid) * grad_cost
            
        if(DEBUG) : print('Calculated Final Gradient:')
        if(DEBUG) : print('--------------------------')
        if(DEBUG) : print('W.r.t ',self.inbound_nodes[0].name,': \n-------------\n', self.gradients[self.inbound_nodes[0]])
X, W, b = Input('X'), Input('W'), Input('b')
f = Linear(X, W, b)
g = Sigmoid(f)
X_ = np.array([[-1., -2.], [-1, -2]])
W_ = np.array([[2., -3], [2., -3]])
b_ = np.array([-3., -5])
feed_dict = {X: X_, W: W_, b: b_}
graph = topological_sort(feed_dict)
output = forward_pass(g, graph)
"""
Output should be:
[[  1.23394576e-04   9.82013790e-01]
 [  1.23394576e-04   9.82013790e-01]]
"""
print("\n\noutput: \n", output)
output: 
 [[  1.23394576e-04   9.82013790e-01]
 [  1.23394576e-04   9.82013790e-01]]

MSE (затраты/убытки)

MSE(θ)=1NN∑i=1(f(yi|θ)−ai)2 $f(y_i \theta) — это функция, которая вычисляет y_i с параметрами\theta$ или весами, чему она уже научилась

Градиент:

Относительно y: 2N(y−a)

Относительно a: −2N(y−a)

class MSE(Node):
    def __init__(self, y, a):
        """
        The mean squared error cost function.
        Should be used as the last node for a network.
        """
        # Call the base class' constructor.
        Node.__init__(self, [y, a])
        self.name = "MSE_Op"
        
    def forward(self):
        """
        Calculates the mean squared error.
        """
        # NOTE: We reshape these to avoid possible matrix/vector broadcast
        # errors.
        #
        # For example, if we subtract an array of shape (3,) from an array of shape
        # (3,1) we get an array of shape(3,3) as the result when we want
        # an array of shape (3,1) instead.
        #
        # Making both arrays (3,1) insures the result is (3,1) and does
        # an elementwise subtraction as expected.
        if(DEBUG) : print("\n----->Forward pass @ " ,self.name)
        if(DEBUG) : print("Initial value of {} is {}".format(self.name, self.value))
            
        y = self.inbound_nodes[0].value.reshape(-1, 1)
        a = self.inbound_nodes[1].value.reshape(-1, 1)
        self.m = self.inbound_nodes[0].value.shape[0]
        # Save the computed output for backward.
        self.diff = y - a
        self.value = np.mean(np.square(self.diff))
        
        if(DEBUG) : print("{}:\n{} - \n{}:\n{} =\n {}:\n{}".format(self.inbound_nodes[0].name,y,
                                                                  self.inbound_nodes[1].name, a,
                                                                  self.name, self.value))
    def backward(self):
        """
        Calculates the gradient of the cost.
        This is the final node of the network so outbound nodes
        are not a concern.
        """
        if(DEBUG) : print('\n')
        if(DEBUG) : print('=============================\n\tBP @ MSE\n=============================\n')
        if(DEBUG) : print('Initial Gradients:\n------------------')
        if(DEBUG) : print('Nothing! Since this node will be the last node!!!\n')
        
        self.gradients[self.inbound_nodes[0]] = (2 / self.m) * self.diff
        self.gradients[self.inbound_nodes[1]] = (-2 / self.m) * self.diff #for eg. this goes back to Sigmoid
        
        if(DEBUG) : print('Calculated Final Gradient:\n----------------')
        if(DEBUG) : print('W.r.t ',self.inbound_nodes[0].name,': \n------------------\n', self.gradients[self.inbound_nodes[0]])
        if(DEBUG) : print('W.r.t ',self.inbound_nodes[1].name,': \n------------------\n', self.gradients[self.inbound_nodes[1]])
y, a = Input('y'), Input('a')
cost = MSE(y, a)
y_ = np.array([1, 2, 3])
a_ = np.array([4.5, 5, 10])
feed_dict = {y: y_, a: a_}
graph = topological_sort(feed_dict)
# forward pass
# forward_pass_mse(graph)
forward_pass(cost, graph)
"""
Expected output
23.4166666667
"""
print(cost.value)
----->Forward pass @  y
----->Forward pass @  a
----->Forward pass @  MSE_Op
Initial value of MSE_Op is None
y:
[[1]
 [2]
 [3]] - 
a:
[[  4.5]
 [  5. ]
 [ 10. ]] =
 MSE_Op:
23.416666666666668
23.4166666667

2. Обратное распространение

Частичная дифференциация
https://math.stackexchange.com/questions/45952/why-do-we-take-a-derivative
https://simple.wikipedia.org/wiki/Производные_(математика)
https://en.wikipedia.org/wiki/Partial_derivative
https://en.wikipedia.org/wiki/Chain_rule

Академия Хана:

Ссылки:

С ASCII-артом:

-2
  x--------
  -4       \ 
            \  -----  q 3
              |  +  | -----
            /  -----   -4   \
   5       /                 \
  y--------                   \  -----  -12
   -4                           |  *  | ----- f(x,y,z) = (x+y)z
                              /  -----    1
                             /
   -4                       /  
  z-------------------------
    3

Пусть f(x,y,z)=(x+y)z

быть некоторой функцией, которая вычисляет некоторую ошибку.

Например. значения: х=-2, у=5, г=-4

Пусть q=x+yЧастные производные…∂q∂x=1∂q∂y=1

Теперь f=qzЧастные производные…∂f∂q=z=−4∂f∂z=q=3

Сколько каждый из x, y, z вносит вклад в f,

i.e ∂f∂x ∂f∂y ∂f∂z

Мы знаем, что ∂f∂f=1

Цепное правило:
∂f∂y=∂f∂q∂q∂y=−4,1=−4

∂f∂x=∂f∂q∂q∂x=−4.1=−4

**Другой пример **

def forward_and_backward(graph):
    """
    Performs a forward pass and a backward pass through a list of sorted Nodes.
    Arguments:
        `graph`: The result of calling `topological_sort`.
    """
    # Forward pass
    for n in graph:
        n.forward()
    # Backward pass
    # see: https://docs.python.org/2.3/whatsnew/section-slices.html
    for n in graph[::-1]:
        n.backward()

Рассмотрим сеть с линейным узлом l1

, сигмовидный узел s и еще один линейный узел l2, за которым следует узел MSE для расчета стоимости, C

Мы можем видеть, что каждое из значений этих узлов течет вперед и в конечном итоге дает стоимость C

. Например, значение второго линейного узла l2 входит в стоимостной узел и определяет значение этого узла. Соответственно, изменение l2 вызовет изменение C

. Мы можем записать эту связь между изменениями в виде градиента,

∂C∂l2

Вот что значит градиент, это наклон, насколько вы меняете стоимость ∂C

при изменении l2, ∂l2

. Таким образом, узел с большим градиентом по отношению к стоимости будет способствовать большему изменению стоимости. Таким образом, мы можем возложить ответственность за стоимость на каждый узел. Чем больше градиент для узла, тем больше на него возлагают вину за окончательную стоимость. И чем больше виноват узел, тем больше мы будем обновлять его на шаге градиентного спуска.

Если мы хотим обновить один из весов с помощью градиентного спуска, нам понадобится градиент стоимости по отношению к этим весам. Давайте посмотрим, как мы можем использовать эту структуру, чтобы найти градиент для весов во втором слое, w2

. Мы хотим вычислить градиент C относительно w2

∂C∂w2

На графике видно, что w2

связан с l2, поэтому изменение в w2 создаст изменение в l2, которое затем создаст изменение в C. Мы можем возложить вину на w2, отправив градиент стоимости обратно через сеть. Сначала у вас есть, насколько l2 повлиял на C, затем насколько w2 повлиял на l2. Перемножив эти градиенты вместе, вы получите полную вину, приписываемую w2.

Умножение этих градиентов — это просто применение цепного правила: ∂C∂w2=∂C∂l2∗∂l2∂w2

Вы можете увидеть на графике w2

, l2 и C связаны вместе. Любое изменение w2 вызовет изменение l2, и величина этого изменения определяется градиентом ∂l2/∂w2. Теперь, поскольку l2 изменяется, это вызовет изменение стоимости C, и величина этого изменения определяется градиентом ∂C/∂l2. Вы можете думать о цепном правиле аналогично эффекту домино: изменение чего-либо в сети будет распространяться через нее, изменяя по пути другие узлы. Если вы думаете о цепном правиле как об обычных дробях, вы можете увидеть, что ∂l2 в знаменателе и числителе сокращаются, и вы получаете обратно ∂C/∂w2 (хотя это не совсем работает как обычные дроби, но помогает сохранить следим за вещами.) Хорошо, давайте проработаем градиент для w2. Во-первых, нам нужно знать градиент для l2

Стоимость C=1m∑x(y(x)−l2)2

А значение для второго линейного узла равно l2=w2.s+b2

где w2

, s и b2 — все векторы, а w2.s означает скалярное произведение w2 и s

∂C∂l2=∂∂l2[1m∑x(y(x)−l2)2]=−2m∑x(y(x)−l2)

∂l2∂w2=∂∂w2[w2.s+b2]=s

Собрав их вместе, вы получите градиент для w2.

∂C∂w2=−2m∑x(y(x)−l2)s

Это градиент, который вы используете в обновлении градиентного спуска для w2.

. Вы можете видеть, что мы сделали здесь, мы прошлись по графику и умножили все градиенты, которые мы нашли по пути. Теперь давайте углубимся и рассчитаем градиент для w1.

. Здесь мы используем тот же метод, что и раньше, просматривая график в обратном направлении.

Надеюсь, теперь понятно, как записать градиент для w1

просто глядя на график. Используя цепное правило, мы запишем градиенты для каждого узла, идущего в обратном направлении по графу, пока не дойдем до w1.

∂C∂w1=∂C∂l2∂l2∂s∂s∂l1∂l1∂w1

Теперь мы можем начать вычислять каждый градиент в этом выражении, чтобы получить градиент для w1.

∂l2∂s=∂∂s[w2.s+b2]=w2

Следующая часть — это градиент сигмовидной функции, s=f(l1)

. Поскольку мы используем здесь логистическую функцию, производную можно записать в терминах самой сигмоиды.

∂s∂l1=∂∂l1f(l1)=f(l1)(1−f(l1)) ∂l1∂w1=∂∂w1[w1.x+b1]=x

Сложив все это вместе, вы получите

∂C∂w1=−2m∑x(y(x)−l2).w2.f(l1).(1−f(l1)).x

Теперь мы можем видеть четкую закономерность. Чтобы найти градиент, вы просто умножаете градиенты для всех узлов перед ним в обратном направлении от стоимости. Это идея обратного распространения. Градиенты передаются по сети в обратном направлении и используются с градиентным спуском для обновления весов и смещений. Если узел имеет несколько исходящих узлов, вы просто суммируете градиенты от каждого узла.

T_DEBUG = False
DEBUG = True
#Placeholders
X = Input('X') # 2 x 2
y = Input('y') # 2 x 1
W1 = Input('W1') # 2 x 1
b = Input('b') # 1 x 1
#Graph Operations
f1 = Linear(X, W1, b) # 2 x 2 . 2 x 1 + 1 x 1 = 2 x 1 
activation_output = Sigmoid(f1) # 2 x 1

X_ = np.array([[-1., -2.], [-1, -2]])
W1_ = np.array([[2.], [3.]])
b_ = np.array([-3.])
y_ = np.array([1, 2])
if True: #For simple graph
    cost = MSE(y, activation_output) # 2 x 1
    feed_dict = {
        X: X_,
        y: y_,
        W1: W1_,
        b: b_,
    }
    graph = topological_sort(feed_dict)
    forward_and_backward(graph)
    # return the gradients for each Input
    gradients = [t.gradients[t] for t in [X, y, W1, b]]
else: #For the once explained above!
    W2 = Input('W2') 
    W2_ = np.array([[2.]]) # 1 x 1
    b2 = Input('b2') 
    b2_ = np.array([-3.]) # 1,
    
    f2 = Linear(activation_output, W2, b2) # 2 x 1 . 1 x 1 + 1 x 1 = 2 x 1
    cost = MSE(y, f2) # 2 x 1 - 2 x 1
    feed_dict = {
        X: X_,
        y: y_,
        W1: W1_,
        b: b_,
        W2: W2_,
        b2: b2_
    }
    graph = topological_sort(feed_dict)
    forward_and_backward(graph)
    # return the gradients for each Input
    gradients = [t.gradients[t] for t in [X, y, W1, b, W2, b2]]
 
"""
Expected output for case 1:
[array([[ -3.34017280e-05,  -5.01025919e-05],
       [ -6.68040138e-05,  -1.00206021e-04]]), 
 array([[ 0.9999833],
       [ 1.9999833]]), 
 array([[  5.01028709e-05],
       [  1.00205742e-04]]), 
 array([ -5.01028709e-05])
]
"""
print('\n\n', gradients)
----->Forward pass @  y
----->Forward pass @  b
----->Forward pass @  W1
----->Forward pass @  X
----->Forward pass @  Linear_OP
X:
[[-1. -2.]
 [-1. -2.]] * 
W1:
[[ 2.]
 [ 3.]] + 
b:
[-3.] =
 Linear_OP:
[[-11.]
 [-11.]]
----->Forward pass @  Sigmoid_Op
Initial value of Sigmoid_Op is None
Linear_OP:
[[-11.]
 [-11.]] ---> Sigmoid_Op:
[[  1.67014218e-05]
 [  1.67014218e-05]]
----->Forward pass @  MSE_Op
Initial value of MSE_Op is None
y:
[[1]
 [2]] - 
Sigmoid_Op:
[[  1.67014218e-05]
 [  1.67014218e-05]] =
 MSE_Op:
2.4999498960133932

=============================
	BP @ MSE
=============================
Initial Gradients:
------------------
Nothing! Since this node will be the last node!!!
Calculated Final Gradient:
----------------
W.r.t  y : 
------------------
 [[ 0.9999833]
 [ 1.9999833]]
W.r.t  Sigmoid_Op : 
------------------
 [[-0.9999833]
 [-1.9999833]]

=============================
	BP @ Sigmoid
=============================
Initial Gradients:
------------------
W.r.t  Linear_OP : 
----------------
 [[ 0.]
 [ 0.]]

Getting  MSE_Op gradient : 
<-----------------------------
 [[-0.9999833]
 [-1.9999833]]

Calculated Final Gradient:
--------------------------
W.r.t  Linear_OP : 
-------------
 [[ -1.67008640e-05]
 [ -3.34020069e-05]]

=============================
	BP @ Linear
=============================
Initial Gradients:
------------------
W.r.t X: 
---------------
[[ 0.  0.]
 [ 0.  0.]]
W.r.t W1: 
---------------
[[ 0.]
 [ 0.]]
W.r.t b: 
---------------
[ 0.]

Getting  Sigmoid_Op gradient is : 
<-----------------------------
 [[ -1.67008640e-05]
 [ -3.34020069e-05]]

Calculated Final Gradient:
----------------
W.r.t  X : 
-------------
 [[ -3.34017280e-05  -5.01025919e-05]
 [ -6.68040138e-05  -1.00206021e-04]]
W.r.t  W1 : 
-------------
 [[  5.01028709e-05]
 [  1.00205742e-04]]
W.r.t  b : 
-------------
 [ -5.01028709e-05]

=============================
	BP @ X
=============================
Initial Gradients:
------------------
W.r.t X: 
------------
0

Getting  Linear_OP gradient : 
<-----------------------------
 [[ -3.34017280e-05  -5.01025919e-05]
 [ -6.68040138e-05  -1.00206021e-04]]

Calculated Final Gradient:(Note: Calculated by next node in the graph!!!)
----------------
W.r.t  X  : 
-------------
 [[ -3.34017280e-05  -5.01025919e-05]
 [ -6.68040138e-05  -1.00206021e-04]]

=============================
	BP @ W1
=============================
Initial Gradients:
------------------
W.r.t W1: 
------------
0

Getting  Linear_OP gradient : 
<-----------------------------
 [[  5.01028709e-05]
 [  1.00205742e-04]]

Calculated Final Gradient:(Note: Calculated by next node in the graph!!!)
----------------
W.r.t  W1  : 
-------------
 [[  5.01028709e-05]
 [  1.00205742e-04]]

=============================
	BP @ b
=============================
Initial Gradients:
------------------
W.r.t b: 
------------
0

Getting  Linear_OP gradient : 
<-----------------------------
 [ -5.01028709e-05]

Calculated Final Gradient:(Note: Calculated by next node in the graph!!!)
----------------
W.r.t  b  : 
-------------
 [ -5.01028709e-05]

=============================
	BP @ y
=============================
Initial Gradients:
------------------
W.r.t y: 
------------
0

Getting  MSE_Op gradient : 
<-----------------------------
 [[ 0.9999833]
 [ 1.9999833]]

Calculated Final Gradient:(Note: Calculated by next node in the graph!!!)
----------------
W.r.t  y  : 
-------------
 [[ 0.9999833]
 [ 1.9999833]]

 [array([[ -3.34017280e-05,  -5.01025919e-05],
       [ -6.68040138e-05,  -1.00206021e-04]]), array([[ 0.9999833],
       [ 1.9999833]]), array([[  5.01028709e-05],
       [  1.00205742e-04]]), array([ -5.01028709e-05])]

3. SGD с бостонским набором данных

Градиентный спуск

x=x-скорость обучения∗∂стоимость∂x

Стохастический градиентный спуск Стохастический градиентный спуск (SGD) — это версия градиентного спуска, в которой при каждом прямом проходе пакет данных случайным образом выбирается из общего набора данных. Помните, мы говорили о размере партии ранее? Это размер партии. В идеале весь набор данных должен передаваться в нейронную сеть при каждом прямом проходе, но на практике это нецелесообразно из-за нехватки памяти. SGD — это аппроксимация градиентного спуска, чем больше пакетов обрабатывается нейронной сетью, тем лучше аппроксимация.

Наивная реализация SGD включает:

Произвольно выберите пакет данных из общего набора данных.
Запуск сети вперед и назад для расчета градиента (с данными из (1)).
Примените обновление градиентного спуска.
Повторяйте шаги 1–3 до тех пор, пока не произойдет сходимость или петля не будет остановлена другим механизмом (т. е. числом эпох).

Если все пойдет хорошо, потери в сети, как правило, должны иметь тенденцию к снижению, указывая на более полезные веса и смещения с течением времени.

Простой пример

f(x)=x4−3x3+2

И его производная f′(x)=4x3−9x2

# The gradient descent algorithm is applied to find a local minimum of the function f(x)=x^4−3x^3+2, with derivative
#f'(x)=4x^3−9x^2. Here is an implementation in the Python programming language.
# From calculation, it is expected that the local minimum occurs at x=9/4
cur_x = 6 # The algorithm starts at x=6
gamma = 0.01 # step size multiplier
precision = 0.00001
previous_step_size = cur_x
def df(x):
    return 4 * x**3 - 9 * x**2
while previous_step_size > precision:
    prev_x = cur_x
    cur_x += -gamma * df(prev_x)
    previous_step_size = abs(cur_x - prev_x)
print("The local minimum occurs at %f" % cur_x)
The local minimum occurs at 2.249965
def sgd_update(trainables, learning_rate=1e-2):
    """
    Updates the value of each trainable with SGD.
    Arguments:
        `trainables`: A list of `Input` Nodes representing weights/biases.
        `learning_rate`: The learning rate.
    """
    # Performs SGD
    #
    # Loop over the trainables
    for t in trainables:
        # Change the trainable's value by subtracting the learning rate
        # multiplied by the partial of the cost with respect to this
        # trainable.
        partial = t.gradients[t]
        t.value -= learning_rate * partial
T_DEBUG = False
DEBUG = False
"""
Have fun with the number of epochs!
Be warned that if you increase them too much,
logs will overshoot :)
"""
import numpy as np
from sklearn.datasets import load_boston
from sklearn.utils import shuffle, resample
# from miniflow import *
# Load data
data = load_boston()
X_ = data['data']
y_ = data['target']
# Normalize data
X_ = (X_ - np.mean(X_, axis=0)) / np.std(X_, axis=0)
n_features = X_.shape[1] # 13
n_hidden = 10
W1_ = np.random.randn(n_features, n_hidden)
b1_ = np.zeros(n_hidden)
W2_ = np.random.randn(n_hidden, 1)
b2_ = np.zeros(1)
# Neural network
X, y = Input(), Input()
W1, b1 = Input(), Input()
W2, b2 = Input(), Input()
l1 = Linear(X, W1, b1)
s1 = Sigmoid(l1)
l2 = Linear(s1, W2, b2)
cost = MSE(y, l2)
feed_dict = {
    X: X_,
    y: y_,
    W1: W1_,
    b1: b1_,
    W2: W2_,
    b2: b2_
}
epochs = 100
# Total number of examples
m = X_.shape[0] # 506
batch_size = 11
steps_per_epoch = m // batch_size
graph = topological_sort(feed_dict)
trainables = [W1, b1, W2, b2]
print("Total number of examples = {}".format(m))
total_loss = []
# Step 4
for i in range(epochs):
    loss = 0
    for j in range(steps_per_epoch):
        # Step 1
        # Randomly sample a batch of examples
        X_batch, y_batch = resample(X_, y_, n_samples=batch_size)
        # Reset value of X and y Inputs
        X.value = X_batch
        y.value = y_batch
        # Step 2
        forward_and_backward(graph)
        # Step 3
        sgd_update(trainables)
        loss += graph[-1].value
        
#     print("=======> Epoch: {}, Loss: {:.3f}".format(i+1, loss/steps_per_epoch))
    total_loss.append(loss/steps_per_epoch)
Total number of examples = 506
plot.plot(range(len(total_loss)), total_loss)
[<matplotlib.lines.Line2D at 0x7f350a0e24a8>]