Identifying the temporal segments in a video that contain content relevant to a category or task is a difficult but interesting problem. This has applications in fine-grained video indexing and retrieval. Part of the difficulty in this problem comes from the lack of supervision since large-scale annotation of localized segments containing the content of interest is very expensive. In this paper, we propose to use the category assigned to an entire video as weak supervision to our model. Using such weak supervision, our model learns to do joint video level categorization and localization of content relevant to the category of the video. This can be thought of as providing both a classification label and an explanation in the form of the relevant regions of the video. Extensive experiments on a large scale data set show our model can achieve good localization performance without any direct supervision and can combine signals from multiple modalities like speech and vision.