Recently I discovered the paper σ-GPTs: A New Approach to Autoregressive Models by Arnaud Pannatier on Twitter. It introduces a very creative twist on the standard autoregressive pretraining of transformers. By performing autoregressive training on shuffled tokens, the model learns to make predictions out of order.

To do this, each token needs to have two position embeddings, one for the position of the token before shuffling, and one for the token to be predicted.

This simple change brings it closer to masked auto encoders and diffusion models, in the sense that it can condition on partial observations to predict the rest. It also allows for different inference strategies.

Out-of-order inference can be useful in a reinforcement learning or planning context. Take for example the upside-down reinforcement learning approach, where reinforcement learning is framed as simply predicting the actions from past observations and a future goal.

A pretrained σ-GPT could be used elegantly in this context. If we start with a goal token, the rest of the tokens are automatically conditioned on the goal. This is very close the the maze example in the σ-GPT paper. am not sure if the actions and observations should be separate tokens as in the decision transformer. They probably should. I hope to run some experiments soon.