Online emotion diffusion is a compound process that involves interactions with multiple modalities. For instance, different behaviors influence the velocity and scale of emotion diffusion in online communities. Depicting and predicting massive online emotions helps to guide the trend of emotion evolution, thus avoiding unprecedented damages in crises. However, most existing work tries to depict and predict online emotions based on models not considering related modalities. There still lacks an efficient modeling framework that promotes performance by leveraging multi-modality knowledge, and quantifies the interactions among different modalities. In this paper, we elaborate a computational model to jointly depict online emotions and behaviors. By introducing a common structure, we can quantify how user emotions interact with the corresponding behaviors. To scale up to large dataset, we propose a hierarchical optimization algorithm to accelerate the convergence of the model. Evaluation on Sina Weibo dataset suggests that prediction error rate is lowered by 69 percent with the proposed model. In addition, the proposed model helps to explain how user emotions influence consequent behaviors in extreme situations.