Text this: Lightweight multi-stage temporal inference network for video crowd counting