Task Definition
Input: Single RGB Image
Output: A Set of detected objects, each containing
- Category Label (From fixed, known set of categories)
- Bounding box (four numbers: x, y, width, height)
Challenges (Compared to Image Classification)
- Multiple Outputs - Need to output variable number of objects/image
- Multiple Types of output - Need to predict “What” as well as “where”
- Large Images - Need more resolution to properly identify each object
Detecting Single objects
- We can use a architecture that produces two outputs
- One predicts the classification
- One predicts the box coordinates
- We use the Softmax Loss and L2 Loss, and use the weighted sum
- We use weighted sum since gradient descent needs a single loss
- MultiTask Loss
Detecting Multiple Objects: Sliding Window
- Add a “Background” Category to the Classifier
- Have a Sliding Window that slides over the image, and runs classification
- This is Too Computationally Heavy. Not good
Region Proposals
- Find a small set of boxes that are likely to cover all objects
- Often based on heuristics
- Used for R-CNN (Region Based CNN)
- What if region (bounding Box) doesn’t quite match with the object?
- Use MultiTask Loss such that Conv Net outputs Class and BBox
- Classify Each Region
- Bounding Box Regression
- Predict “Transform” to correct the Region of Interest