这也是为什么我们把Machine translation model也叫做conditional language model。在形象一点的说法是,一段被翻译后的句子(比如Jane is visiting Africa)本身,是由前一个被翻译前句子的作为条件形成的。这个说法还挺有意思的,让我认清了翻译这件事情的本质。的确是这样。
I want a glass of orange ______ 4242 9665 1 3852 6163 6257
I represent as $ O_{4343} \longrightarrow E \longrightarrow e_{4343} $ want represent as $ O_{9665} \longrightarrow E \longrightarrow e_{9665} $ a represent as $ O_{1} \longrightarrow E \longrightarrow e_{1} $
$ e_{x} $ is a 300 dimentional embedding vector. fill all e into a neural network and then feed to a softmax into a 10000 output vector. neural network with $w^{[1]}$, $b^{[1]}$; softmax parameters are $w^{[2]}$, $b^{[2]}$. the dimensional of neural network is 6 words times 300 dimentional word, which is a 1800 dimentional network layer. also we can decide a window like “a glass of orange __“, which removed “I want”
接下来文章讲述了不同的context上下文组合方式,列举例子如:
原句是:I want a glass of orange juice to go along with my cereal
Last 4 words (a glass of orange _)
4 words on left & right (a glass of orange _ to go along with)
Last 1 word (orange _)
作者表达了不同的应用上下文学习的方法,如果the goal is just to learn word embedding那么,使用后集中简单方法,被认为已经可以很好地学习到了
Landmark Detection, which describe less in the class about how to detect the interal features of an object by key landmarks
Object Detection, talking about how to detect an object with bounding box
Sliding Window
这里主要重点回顾YOLO(you only look once)。这是本课重点阐述的内容,video有两个,作业也是直接就是讲述YOLO,顺带一点其他算法。
起点,将原图划分成19x19的区块 (方便简化计算)
Input image (608, 608, 3)
The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
Each cell in a 19x19 grid over the input image gives 425 numbers.
425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
85 = 5 + 80 where 5 is because (pc,bx,by,bh,bw)(pc,bx,by,bh,bw) has 5 numbers, and and 80 is the number of classes we’d like to detect
You then select only few boxes based on:
Score-thresholding: throw away boxes that have detected a class with a score less than the threshold
Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes
This gives you YOLO’s final output.
What you should remember:
YOLO is a state-of-the-art object detection model that is fast and accurate
It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
The encoding can be seen as a grid where each of the 19x19 cells contains information about 5 boxes.
You filter through all the boxes using non-max suppression. Specifically:
Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes
Intersection over Union (IoU) thresholding to eliminate overlapping boxes
Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise. If you wish, you can also try fine-tuning the YOLO model with your own dataset, though this would be a fairly non-trivial exercise.
# Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep nms_indices = tf.image.non_max_suppression(boxes, scores, max_boxes, iou_threshold)
对于TF的Run始终觉得需要系统的理解一下
1 2 3 4
# Run the session with the correct tensors and choose the correct placeholders in the feed_dict. # You'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0}) out_scores, out_boxes, out_classes = sess.run([scores, boxes, classes], feed_dict={yolo_model.input: image_data, K.learning_phase(): 0})
In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, and you might as well be training a neural network with n[l]=1n[l]=1 for every layer, and the network is no more powerful than a linear classifier such as logistic regression.