Time-series forecasting is an important and challenging task especially when the forecast variable is associated with a person’s health outcomes. Effective forecasting of a person’s physical activity can aid in designing adaptive behavioral interventions to keep the user engaged and adherent to a prescribed routine. This research aimed to develop a lifestyle intervention system with an emphasis on time series forecasting and also to develop a generic multimodal activity forecasting scheme that can be used with either early fusion or late fusion mechanisms. We propose multimodal activity forecasting models based on long short-term memory networks, which are capable of forecasting the number of steps of a user up to 24 hours in advance by examining past data. Our experiments compare multimodal forecasting with single-modal forecasting and the results from different fusion strategies. To evaluate our method, two user studies were conducted to recruit 58 prediabetic veterans and 60 participants with obstructive sleep apnea. Multimodal early fusion forecasting models achieve up to 22.7% and 19.1% lower mean absolute errors than single modality forecasting models on the prediabetes dataset and sleep dataset, respectively. Furthermore, on the goal-based experiments, the early fusion-based multimodal learning models can forecast whether a person will reach their activity goal with 81% and 74% accuracy performance on the prediabetes dataset and sleep dataset, respectively. Our experiments conclude that multimodal forecasting with early fusion is a better choice than multimodal forecasting with late fusion and single-modal forecasting.