Consumer depth sensors are more and more popular and come to our daily lives marked by its recent integration in the latest iPhone X. However, they still suffer from heavy noises which dramatically limit their applications. Although plenty of progresses have been made to reduce the noises and boost geometric details, due to the inherent illness and the real-time requirement, the problem is still far from been solved. We propose a cascaded Depth Denoising and Refinement Network (DDRNet) to tackle this problem by leveraging the multi-frame fused geometry and the accompanying high quality color image through a joint training strategy. The classic rendering equation is delicately exploited in our network in an unsupervised manner. Experimental results indicate that our network achieves real-time denoising and refinement on various categories of static and dynamic scenes. Thanks to the well decoupling of the low and high frequency information in the cascaded network, we achieve superior performance over the state-of-the-art techniques.