Scalable MySQL: Avoid offset for large tables
It’s fairly common to need to iterate through all the rows in a given table and perform an action on each. The usual way to do this is fetch rows from the database in batches using LIMIT, specifying an offset and number of rows to be returned.
SELECT * FROM `myBigTable` LIMIT :OFFSET, :ROW_COUNT;
After each batch of rows is processed, the offset is increased by the size of the batch and the process is repeated until all rows have been processed.
Now that’s not exactly rocket science, but there is a problem – as the offset increases, the time taken for the query to execute progressively increases, which can mean processing very large tables will take an extremely long time. The reason is because offset works on the physical position of rows in the table which is not indexed. So to find a row at offset x, the database engine must iterate through all the rows from 0 to x.
The obvious solution would be to use the primary key instead of offset:
SELECT * FROM `myBigTable` WHERE `id` > :OFFSET LIMIT :BATCH_SIZE;
The problem here is that MySQL doesn’t guarantee that rows appear in the table in the same order as the primary key, so you could end up processing some rows twice, and worse still, some rows not at all. Luckily this can be easily solved by using ORDER BY:
SELECT * FROM `myBigTable` WHERE `id` > :OFFSET ORDER BY `id` ASC;
This might seem a bit counterintuitive using ORDER BY when we want to speed things up, but remember we’re ordering on a uniquely indexed column which will be fast anyway, and compared to the alternative of iterating through each row using an offset, it’s a huge improvement.
Conclusion
The general rule of thumb is “never use offset in a limit clause”. For small tables you probably won’t notice any difference, but with tables with over a million rows you’re going to see huge performance increases.
Remember, this doesn’t just apply to iterating through tables but whenever offset is used. For example implementing pagination of large tables, if the first few pages load quickly but subsequent pages load progressively slower then it’s likely this is the cause (or you are missing some indices!)