Spatial audio is an essential part of virtual reality. Unlike synthesized signals, spatial audio captured in the real world may suffer from background noise which degrades the quality of the signals. While some previous works have addressed this problem, and suggested methods to attenuate the undesired signals while preserving the desired signals with minimum distortion, these only succeed partially. Recently, methods aiming to achieve preservation of the desired signal in its entirety have been proposed, and in this work we study such methods that are based on time-frequency masking. Two masks were investigated: one in the spherical harmonics (SH) domain, and the other in the plane wave density (PWD) function domain, referred to here as the spatial domain. These two methods were compared with a low-end reference method that uses a single maximum directivity beamformer followed by a single channel time-frequency mask. A subjective investigation was conducted to estimate the performance of these methods, and showed that the spatial mask preserves the desired sound field better, while the SH mask preserves the spatial cues of the residual noise better.