Lecture 15 Object Detection | Notion

Task Definition

Input: Single RGB Image

Output: A Set of detected objects, each containing

Category Label (From fixed, known set of categories)
Bounding box (four numbers: x, y, width, height)

Challenges (Compared to Image Classification)

Multiple Outputs - Need to output variable number of objects/image
Multiple Types of output - Need to predict “What” as well as “where”
Large Images - Need more resolution to properly identify each object

Detecting Single objects

We can use a architecture that produces two outputs
- One predicts the classification
- One predicts the box coordinates
We use the Softmax Loss and L2 Loss, and use the weighted sum
- We use weighted sum since gradient descent needs a single loss
- MultiTask Loss

Detecting Multiple Objects: Sliding Window

Add a “Background” Category to the Classifier
Have a Sliding Window that slides over the image, and runs classification
- This is Too Computationally Heavy. Not good

Region Proposals

Find a small set of boxes that are likely to cover all objects
Often based on heuristics
Used for R-CNN (Region Based CNN)
- What if region (bounding Box) doesn’t quite match with the object?
- Use MultiTask Loss such that Conv Net outputs Class and BBox
Classify Each Region
- Bounding Box Regression
  - Predict “Transform” to correct the Region of Interest