Scalable MySQL: Avoid offset for large tables

It’s fairly common to need to iterate through all the rows in a given table and perform an action on each. The usual way to do this is fetch rows from the database in batches using LIMIT, specifying an offset and number of rows to be returned.

SELECT * FROM `myBigTable` LIMIT :OFFSET, :ROW_COUNT;

After each batch of rows is processed, the offset is increased by the size of the batch and the process is repeated until all rows have been processed.

Now that’s not exactly rocket science, but there is a problem – as the offset increases, the time taken for the query to execute progressively increases, which can mean processing very large tables will take an extremely long time. The reason is because offset works on the physical position of rows in the table which is not indexed. So to find a row at offset x, the database engine must iterate through all the rows from 0 to x.

The obvious solution would be to use the primary key instead of offset:

SELECT * FROM `myBigTable` WHERE `id` > :OFFSET LIMIT :BATCH_SIZE;

The problem here is that MySQL doesn’t guarantee that rows appear in the table in the same order as the primary key, so you could end up processing some rows twice, and worse still, some rows not at all. Luckily this can be easily solved by using ORDER BY:

SELECT * FROM `myBigTable` WHERE `id` > :OFFSET ORDER BY `id` ASC;

This might seem a bit counterintuitive using ORDER BY when we want to speed things up, but remember we’re ordering on a uniquely indexed column which will be fast anyway, and compared to the alternative of iterating through each row using an offset, it’s a huge improvement.

Conclusion

The general rule of thumb is “never use offset in a limit clause”. For small tables you probably won’t notice any difference, but with tables with over a million rows you’re going to see huge performance increases.

Remember, this doesn’t just apply to iterating through tables but whenever offset is used. For example implementing pagination of large tables, if the first few pages load quickly but subsequent pages load progressively slower then it’s likely this is the cause (or you are missing some indices!)

This entry was posted in High Scalability, MySQL and tagged , , , , , , , , . Bookmark the permalink.

8 Responses to Scalable MySQL: Avoid offset for large tables

  1. James C says:

    Hi Nick,
    Cheers for the useful post – I wasn’t aware of that so will definitely be using that in the future!

  2. Pingback: High Fibre Programming » Multitasking: PHP in parallel

  3. Jack Kanaska says:

    Did you forgot to add “LIMIT :BATCH_SIZE” in the final query?

  4. Nick says:

    Hi Jack,

    Yes, indeed I did – thanks for pointing that out! The final query should be:

    SELECT * FROM `myBigTable` WHERE `id` > :OFFSET ORDER BY `id` ASC LIMIT :BATCH_SIZE;

  5. Tibi Neagu says:

    Hi Nick,

    This is definitely a great solution, but what happens if you have lots of missing lines in the table? For example I have a table with 6M rows, but the biggest ID is around 9M.

    Normally I’d count the entries in the table and divide them by the batch size, to get an idea of how many times I need to run the query. But in this case, the last batch would start at about (let’s say) id=6M – and the IDs ranging from 6M to 9M would be ignored…

  6. Nick says:

    Hello.

    Just thinking off the top of my head, one solution might be first to find the range of values in the primary key:

    SELECT MIN(id), MAX(id) FROM myBigTable

    Then in your code you could start at the minimum value, incrementing by your chosen batch size on each iteration up until you reach the maximum value.

    In this case, if you have values missing in the primary key then the returned rows will be less than the batch size, but unless you have some really huge gaps then it shouldn’t be a problem.

  7. ben says:

    Hello;

    Didn’t work for me at all :(

    Below was much faster compared to your solution.

    SELECT `products_table`.`id`, `code`, `class`, `category`, `status`, `price`, `production_date`, `products_status`.`title` AS STATUS
    FROM (`products_table`)
    JOIN `products_status` ON `products_status`.`id` = `products_table`.`status`
    WHERE `class` = ‘4’
    AND `category` = ‘M’
    LIMIT 14900, 100

  8. Nick says:

    I guess it probably won’t work in all cases, a lot will depend on your table in terms of data and indices. You’d need to look at the explain plans to determine what’s really going on. Do you have an index on products_table.class? MySQL might have a problem using the two together, but as I say, the explain plan will tell you more.

Leave a Reply

Your email address will not be published. Required fields are marked *