Vitali Petsiuk vpetsiuk@bu.edu, Abir Das dasabir@bu.edu, Kate Saenko saenko@bu.edu
In this paper, the authors propose an approach called RISE that generates an importance map which indicates how salient each pixel is for the model’s prediction.
They measure the importance of an image region by perturbing it and observe the effect on the black box decision. They estimate the importance of pixels by dimming them in random combinations, reducing their intensities down to zero. They first mask the image by preserving only a subset of pixels, then the confidence score for the masked image is computed by the black box. The saliency map is computed as a weighted sum of random masks, where the weights are the probability scores adjusted for the distribution of the random maks. They incorporate bilinear upsampling for mask generation which results in a smooth importance map.
The main question that this paper is asking about a model is looking at how salient each pixel in an image is for the model’s prediction and more specifically if they can do so without using gradients or other internal network state, which would work on black-box models.
I think this paper is mostly technically sound as it is straightforward to implement and the results in the paper show how capable RISE is in identifying important pixels, particularly in black-box models. There are some concerns that I have though. The pointing game metric detailed in the paper feels a bit arbitrary as using only a single point to determine if an example is considered a hit seems prone to a bunch of noise and falls into the trap described earlier in the paper with using human-centered explanations to describe model decisions.
The paper mentions the application of RISE with image captioning which relates to a project that did in a startup on video captioning and trying to analyze the video for key frames that describe a particular scene. If I was able to understand what parts of each key frame was important, that might give better insight into more accurate captioning.