Java How to Retrieve More Than 1 Million Rows in a Resultset

Java how to retrieve more than 1 million rows in a Resultset

@MickMnemonic Thanks for your help this solved the issue.

Setting fetchSize alone might not be enough for the MySQL driver to start streaming the data from the DB instead of loading everything at once. You could try with

stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, 
ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(Integer.MIN_VALUE);

Does a ResultSet load all data into memory or only when requested?

The Java ResultSet is a pointer (or cursor) to the results in the database. The ResultSet loads records in blocks from the database. So to answer your question, the data is only fetched when you request it but in blocks.

If you need to control how many rows are fetched at once by the driver, you can use the setFetchSize(int rows) method on the ResultSet. This will allow you to control how big the blocks it retrieves at once.

How many times does a Java ResultSet ask for data from database?

The "Result Set" documentation states:

Fetch Size

By default, when Oracle JDBC runs a query, it retrieves a result set of 10 rows at a time from the database cursor. This is the default Oracle row fetch size value. You can change the number of rows retrieved with each trip to the database cursor by changing the row fetch size value.

Standard JDBC also enables you to specify the number of rows fetched with each database round-trip for a query, and this number is referred to as the fetch size. In Oracle JDBC, the row-prefetch value is used as the default fetch size in a statement object. Setting the fetch size overrides the row-prefetch setting and affects subsequent queries run through that statement object.

Fetch size is also used in a result set. When the statement object run a query, the fetch size of the statement object is passed to the result set object produced by the query. However, you can also set the fetch size in the result set object to override the statement fetch size that was passed to it.

Note:

Changes made to the fetch size of a statement object after a result set is produced will have no affect on that result set.

The result set fetch size, either set explicitly, or by default equal to the statement fetch size that was passed to it, determines the number of rows that are retrieved in any subsequent trips to the database for that result set. This includes any trips that are still required to complete the original query, as well as any refetching of data into the result set. Data can be refetched, either explicitly or implicitly, to update a scroll-sensitive or scroll-insensitive/updatable result set.

Setting the Fetch Size

The following methods are available in all Statement, PreparedStatement, CallableStatement, and ResultSet objects for setting and getting the fetch size:

void setFetchSize(int rows) throws SQLException
int getFetchSize() throws SQLException

To set the fetch size for a query, call setFetchSize on the statement object prior to running the query. If you set the fetch size to N, then N rows are fetched with each trip to the database.

After you have run the query, you can call setFetchSize on the result set object to override the statement object fetch size that was passed to it. This will affect any subsequent trips to the database to get more rows for the original query, as well as affecting any later refetching of rows.

If you want to know the number of trips to the database then call getFetchSize() on the statement and then divide the total number of rows in the result set by the fetch size (and round up).

Retrieve million records in batches from Apache Hive - Java 8 + Spring Boot + Hive 1.2.1 version

If you have some Primary Key candidate (can be a list of columns) which can be used in order by then you can use row_number():

select --column list here
from (
select t.*, row_number() OVER (ORDER by PK) as rn --use PK in order by
from table_name t
) s
where rn between 1000001 and 2000000

Just check that your PK candidate is unique and not null, because if PK is not unique or can be null then row_number may have non-deterministic behavior and may produce different results from run to run.

And If you do not have PK, this functionality can not be implemented because Hive may return differently ordered rows due to parallel execution, this will result in duplication of rows across batches, some rows can be lost.

Conversion of resultset to Java model slow for a million records

Three suggestions.

  1. As @Luiggi Mendoza writes as a comment, initialize your List with a large initial buffer, so it isn't reallocating and copying all day.
  2. Depending on your DB and JDBC driver, you may get a speedup with ps.setFetchSize(50000); (Play around with constant.) If your fetch size is small you are doing a lot of I-want-more round-trips to your server.
  3. Don't guess which step is the bottleneck; use a profiler.


Related Topics



Leave a reply



Submit