Model selection methods based on stochastic regularization have been widely used in deep learning due to their simplicity and effectiveness. The well-known Dropout method treats all units, visible or hidden, in the same way, thus ignoring any a priori information related to grouping or structure. Such structure is present in multi-modal learning applications such as affect analysis and gesture recognition, where subsets of units may correspond to individual modalities. Here we describe Modout, a model selection method based on stochastic regularization, which is particularly useful in the multi-modal setting. Different from other forms of stochastic regularization, it is capable of learning whether or when to fuse two modalities in a layer, which is usually considered to be an architectural hyper-parameter by deep learning researchers and practitioners. Modout is evaluated on two real multi-modal datasets. The results indicate improved performance compared to other forms of stochastic regularization. The result on the Montalbano dataset shows that learning a fusion structure by Modout is on par with a state-of-the-art carefully designed architecture.