Selecting Multiple Rows by Id, Is There a Faster Way Than Where In

Selecting multiple rows by ID, is there a faster way than WHERE IN

OK so I got it going really fast by defining a table type and then passing that type directly into the query and joining onto it.

in SQL

CREATE TYPE [dbo].[IntTable] AS TABLE(
    [value] [int] NULL
)

in code

DataTable dataTable = new DataTable("mythang");
dataTable.Columns.Add("value", typeof(Int32));

toSelect.ToList().ForEach(selectItem => dataTable.Rows.Add(selectItem));

using (SqlCommand command = new SqlCommand(
    @"SELECT * 
    FROM [dbo].[Entities] e 
    INNER JOIN @ids on e.id = value", con))
{
    var parameter = command.Parameters.AddWithValue("@ids", dataTable);
    parameter.SqlDbType = System.Data.SqlDbType.Structured;
    parameter.TypeName = "IntTable";

    using (SqlDataReader reader = command.ExecuteReader())
    {
        while (reader.Read())
        {
            results.Add(reader.GetInt32(0));
        }
    }
}

this produces the following results

Querying for 1 random keys (passed in table value) took 2ms
Querying for 1000 random keys (passed in table value) took 3ms
Querying for 2000 random keys (passed in table value) took 4ms
Querying for 3000 random keys (passed in table value) took 6ms
Querying for 4000 random keys (passed in table value) took 8ms
Querying for 5000 random keys (passed in table value) took 9ms
Querying for 6000 random keys (passed in table value) took 11ms
Querying for 7000 random keys (passed in table value) took 13ms
Querying for 8000 random keys (passed in table value) took 17ms
Querying for 9000 random keys (passed in table value) took 16ms
Querying for 10000 random keys (passed in table value) took 18ms

What is the best way to select multiple rows by ID in sql?

SELECT *
FROM `table`
where ID in (5263, 5625, 5628, 5621)

is probably better, but not faster.

Most efficient way to SELECT rows WHERE the ID EXISTS IN a second table

Summary:

IN and EXISTS performed similarly in all scenarios.. Below are the parameters used to validate..
Execution cost,Time:

Same for both and optimizer produced same plan.

Memory Grant:

Same for both queries

Cpu Time,Logical reads :

Exists seems to outperform IN by little bit margin in terms of CPU Time,though reads are same..

I ran each query 10 times each using below test data set..

A very large subquery result set (100000 rows)
Duplicate rows
Null rows

For all the above scenarios, both IN and EXISTS performed in identical manner.

Some info about Performance V3 database used for testing.
20000 customers having 1000000 orders, so each customer is randomly duplicated (in a range of 10 to 100) in the orders table.

Execution cost,Time:

Below is screenshot of both queries running. Observe each query relative cost.

Sample Image

Memory Cost:

Memory grant for the two queries is also same..I Forced MDOP 1 so as not to spill them to TEMPDB..

Sample Image

CPU Time ,Reads:

For Exists:

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Customers'. Scan count 1, logical reads 109, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 1, logical reads 3855, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 469 ms,  elapsed time = 595 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

For IN:

(20000 row(s) affected)
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Customers'. Scan count 1, logical reads 109, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 1, logical reads 3855, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 547 ms,  elapsed time = 669 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

In each case, the optimizer is smart enough to rearrange the queries.

I tend to use EXISTS only though (my opinion). One use case to use EXISTS is when you don't want to return a second table result set.

Update as per queries from Martin Smith:

I ran the below queries to find the most effective way to get rows from the first table for which a reference exists in the second table.

SELECT DISTINCT c.*
FROM Customers c
JOIN Orders o ON o.custid = c.custid   

SELECT c.*
FROM Customers c
INNER JOIN (SELECT DISTINCT custid FROM Orders) AS o ON o.custid = c.custid

SELECT *
FROM Customers C
WHERE EXISTS(SELECT 1 FROM Orders o WHERE o.custid = c.custid)

SELECT *
FROM Customers c
WHERE custid IN (SELECT custid FROM Orders)

All the above queries share the same cost with the exception of 2nd INNER JOIN, Plan being the same for the rest.

Sample Image

Memory Grant:

This query

SELECT DISTINCT c.*
FROM Customers c
JOIN Orders o ON o.custid = c.custid

required memory grant of

Sample Image

This query

SELECT c.*
FROM Customers c
INNER JOIN (SELECT DISTINCT custid FROM Orders) AS o ON o.custid = c.custid

required memory grant of ..

Sample Image

CPU Time,Reads:

For Query :

SELECT DISTINCT c.*
FROM Customers c
JOIN Orders o ON o.custid = c.custid   

(20000 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 48, logical reads 1344, physical reads 96, read-ahead reads 1248, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 5, logical reads 3929, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Customers'. Scan count 5, logical reads 322, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 1453 ms,  elapsed time = 781 ms.

For Query:

SELECT c.*
FROM Customers c
INNER JOIN (SELECT DISTINCT custid FROM Orders) AS o ON o.custid = c.custid

(20000 row(s) affected)
Table 'Customers'. Scan count 5, logical reads 322, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 5, logical reads 3929, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 1499 ms,  elapsed time = 403 ms.

Is there a way to select the maximum row id in MySQL without scanning all rows with MAX()?

The ugly fix is to add an index on groupID, id

alter table `table` add index groupId_with_id_idx (groupId, id);
desc SELECT MAX(id) FROM table use index (groupId_with_id_idx) WHERE groupID=12345;
/* the execution plan should return "Select tables optimized away" */

What is Select tables optimized away?

Faster way of Selecting a row from 3 tables using same id#

You can do this in a single SQL statement using a sub query to find the ID.

SELECT
  address_contact.id,
  address_contact.lastname,
  address_contact.firstname,
  address_contact.primaryAddType,
  address_address.id,
  address_address.phone1,
  address_address.phone2,
  address_address.line2,
  address_email.id,
  address_email.email
FROM
  address_address
  LEFT JOIN address_contact ON address_address.id = address_contact.id 
  LEFT JOIN address_email   ON address_address.id = address_email.id                  
  WHERE address_contact.id = (
    SELECT id FROM address_contact ORDER BY lastUpdate DESC LIMIT 1
  )

What is the fastest way to select a row of a data frame with a specific ID value?

Using data.table, we can set key on a column which physically reorders the data by the column specified in increasing order, which allows us to subset then by using binary search.

library(data.table)
setkey(setDT(data1), id)[.(id1)]

Alternatively, from data.table versions 1.9.4+, subsets of the form DT[x == .] and DT[x %in% .] are both optimised internally to create an index automatically and then use binary search to subset on successive runs, which is incredibly fast (see benchmarks below).

setDT(data1)[id == id1] # internally optimised to generate index automatically

Have a look at this post for more info.

data

 set.seed(24)
 data1 <- data.frame(id= sample(1:6, 25, replace=TRUE), val=rnorm(25))
 id1 <- 5L

PS: setDT() converts the data.frame to data.table by reference.

Benchmarks

set.seed(29)
dat2 <- data.frame(id= sample(1:100, 1e8, replace=TRUE), val=rnorm(1e8))

# data.frame subset in base R
system.time(dat2[dat2$id == id1,])
#   user  system elapsed 
#  6.287   0.646   7.081 

# base R like syntax on data.table; create index and subset using binary search
system.time(setDT(dat2)[id == id1])
#  user  system elapsed 
# 0.646   0.232   0.889 
# successive runs are incredibly fast!
# 0.037   0.002   0.039
# 0.040   0.002   0.042 

# alternatively set key once 
system.time(setkey(setDT(dat2), id))
#  2.908   0.499   3.440 
# and use binary search explicitly
system.time(dat2[.(id1)])
#   user  system elapsed 
#  0.009   0.002   0.012

Select rows with same id but different value in another column

This ought to do it:

SELECT *
FROM YourTable
WHERE ARIDNR IN (
    SELECT ARIDNR
    FROM YourTable
    GROUP BY ARIDNR
    HAVING COUNT(*) > 1
)

The idea is to use the inner query to identify the records which have a ARIDNR value that occurs 1+ times in the data, then get all columns from the same table based on that set of values.

select rows in sql with latest date for each ID repeated multiple times

This question has been asked before. Please see this question.

Using the accepted answer and adapting it to your problem you get:

SELECT tt.*
FROM myTable tt
INNER JOIN
    (SELECT ID, MAX(Date) AS MaxDateTime
    FROM myTable
    GROUP BY ID) groupedtt 
ON tt.ID = groupedtt.ID 
AND tt.Date = groupedtt.MaxDateTime

Selecting Multiple Rows by Id, Is There a Faster Way Than Where In