defforward(self, x, target): """ forward propagation """ assert x.dim() == 2, "dimension of input should be 2" exp_x = torch.exp(x) y = exp_x / exp_x.sum(1).unsqueeze(1).expand_as(exp_x) # parameter "target" is a LongTensor and denotes the labels of classes, here we need to convert it into one hot vectors t = torch.zeros(y.size()).type(y.type()) for n inrange(t.size(0)): t[n][target[n]] = 1
output = (-t * torch.log(y)).sum() / y.size(0) # output should be a tensor, but the output of sum() is float output = torch.Tensor([output]).type(y.type()) self.y = y # save for backward self.t = t # save for backward return output
to use the auto-grad scheme of PyTorch, we also define a function to execute the same operation of forward propagation of softmax loss
1 2 3 4 5 6 7 8 9 10
defSoftmaxLossFunc(x, target): exp_x = torch.exp(x) y = exp_x / exp_x.sum(1).unsqueeze(1).expand_as(exp_x) t = torch.zeros(y.size()).type(y.data.type()) for n inrange(t.size(0)): t[n][target.data[n]] = 1
deftest_softmax_loss_backward(): """ analyse the difference between autograd and manual grad """ # generate random testing data x_size = 3200 x = torch.randn(x_size, x_size) # .cuda() # use .cuda for GPU mode x_var = Variable(x, requires_grad=True) # convert tensor into Variable
# compute output of softmax loss y_hat = SoftmaxLossFunc(x_var_copy, target_var)
# compute gradient of input data with two different method y.backward() # manual gradient y_hat.backward() # auto gradient # compute difference of gradients grad_dist = (x_var.grad - x_var_copy.grad).data.abs().sum()
outputs:
the distance between our implementation and PyTorch auto-gradient is about e-7 under 32 bits floating point precision, and our backward operation is slightly faster than the baseline
===================================================== |===> testing softmax loss forward distance between y_hat and y: 0.0 |===> testing softmax loss backward y: Variable containing: 8.5553 [torch.FloatTensor of size 1]
y_hat: Variable containing: 8.5553 [torch.FloatTensor of size 1]
distance between x.grad and x_copy.grad: 1.11203504294e-07 |===> comparing time-costing time of manual gradient: 1.13225889206 time of auto gradient: 1.40407109261
with 64 bits double precision, the difference of gradient is reduced into e-16. Notice that the outputs of two sofmaxloss function have a gap of e-7. Again, our method is slightly faster.
===================================================== |===> testing softmax loss forward distance between y_hat and y: 2.31496107617e-07 |===> testing softmax loss backward y: Variable containing: 8.5468 [torch.DoubleTensor of size 1]
y_hat: Variable containing: 8.5468 [torch.DoubleTensor of size 1]
distance between x.grad and x_copy.grad: 1.99762357071e-16 |===> comparing time-costing time of manual gradient: 1.170181036 time of auto gradient: 2.39760398865