Database table with million of rows
Millions of rows is not a problem, this is what SQL databases are designed to handle, if you have a well designed schema and good indexes.
Use proper types
Instead of storing your dates and times as separate strings, store them either as a single datetime
or separate date
and time
types. See indexing below for more about which one to use. This is both more compact, allows indexing, faster sorting, and it makes available date and time functions without having to do conversions.
Similarly, be sure to use the appropriate numeric type for latitude, and longitude. You'll probably want to use numeric
to ensure precision.
Since you're going to be storing billions of rows, be sure to use a bigint
for your primary key. A regular int can only go up to about 2 billion.
Move repeated data into another table.
Instead of storing information about the device in every row, store that in a separate table. Then only store the device's ID in your log. This will cut down on your storage size, and eliminate mistakes due to data duplication. Be sure to declare the device ID as a foreign key, this will provide referential integrity and an index.
Add indexes
Indexes are what allows a database to search through millions or billions of rows very, very efficiently. Be sure there are indexes on the rows you use frequently, such as your timestamp.
A lack of indexes on date
and deviceID
is likely why your queries are so slow. Without an index, MySQL has to look at every row in the database known as a full table scan. This is why your queries are so slow, you're lacking indexes.
You can discover whether your queries are using indexes with explain
.
datetime
or time
+ date
?
Normally it's best to store your date and time in a single column, conventionally called created_at
. Then you can use date
to get just the date part like so.
select *
from gps_logs
where date(created_at) = '2018-07-14'
There's a problem. The problem is how indexes work... or don't. Because of the function call, where date(created_at) = '2018-07-14'
will not use an index. MySQL will run date(created_at)
on every single row. This means a performance killing full table scan.
You can work around this by working with just the datetime
column. This will use an index and be efficient.
select *
from gps_logs
where '2018-07-14 00:00:00' <= created_at and created_at < '2018-07-15 00:00:00'
Or you can split your single datetime
column into date
and time
columns, but this introduces new problems. Querying ranges which cross a day boundary becomes difficult. Like maybe you want a day in a different time zone. It's easy with a single column.
select *
from gps_logs
where '2018-07-12 10:00:00' <= created_at and created_at < '2018-07-13 10:00:00'
But it's more involved with a separate date
and time
.
select *
from gps_logs
where (created_date = '2018-07-12' and created_time >= '10:00:00')
or (created_date = '2018-07-13' and created_time < '10:00:00');
Or you can switch to a database with partial indexes like Postgresql. A partial index allows you to index only part of a value, or the result of a function. And Postgresql does a lot of things better than MySQL. This is what I recommend.
Do as much work in SQL as possible.
For example, if you want to know how many log entries there are per device per day, rather than pulling all the rows out and calculating them yourself, you'd use group by to group them by device and day.
select gps_device_id, count(id) as num_entries, created_at::date as day
from gps_logs
group by gps_device_id, day;
gps_device_id | num_entries | day
---------------+-------------+------------
1 | 29310 | 2018-07-12
2 | 23923 | 2018-07-11
2 | 23988 | 2018-07-12
With this much data, you will want to rely heavily on group by
and the associated aggregate functions like sum
, count
, max
, min
and so on.
Avoid select *
If you must retrieve 86400 rows, the cost of simply fetching all that data from the database can be costly. You can speed this up significantly by only fetching the columns you need. This means using select only, the, specific, columns, you, need
rather than select *
.
Putting it all together.
In PostgreSQL
Your schema in PostgreSQL should look something like this.
create table gps_devices (
id serial primary key,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigserial primary key,
gps_device_id int references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
create index date_and_device on gps_logs((created_at::date), gps_device_id);
A query can generally only use one index per table. Since you'll be searching on the timestamp and device ID together a lot timestamp_and_device
combines indexing both the timestamp and device ID.
date_and_device
is the same thing, but it's a partial index on just the date part of the timestamp. This will make where created_at::date = '2018-07-12' and gps_device_id = 42
very efficient.
In MySQL
create table gps_devices (
id int primary key auto_increment,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigint primary key auto_increment,
gps_device_id int references gps_devices(id),
foreign key (gps_device_id) references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
Very similar, but no partial index. So you'll either need to always use a bare created_at
in your where
clauses, or switch to separate date
and time
types.
is having millions of tables and millions of rows within them a common practice in MySQL database design?
1 - Definitely not. Almost anyone you ask will tell you millions of tables is a terrible idea.
2 - Millions of ROWS is common, so just fine.
3 - Probably terribly, especially if the queries are written by someone who thinks it's OK to have millions of tables. That tells me this is someone who doesn't understand databases very well.
4 - See #3
5 - Impossible to tell. You will have a lot of extra overhead from the extra tables as they all need extra metadata. Space needed will depend on indexes and how wide the tables are, along with a lot of other factors.
In short, this is a very very very seriously bad idea and you should not do it.
Fastest method to fill a database table with 10 Million rows
Using SQL to load a lot of data into a database will usually result in poor performance. In order to do things quickly, you need to go around the SQL engine. Most databases (including Firebird I think) have the ability to backup all the data into a text (or maybe XML) file and to restore the entire database from such a dump file. Since the restoration process doesn't need to be transaction aware and the data isn't represented as SQL, it is usually very quick.
I would write a script that generates a dump file by hand, and then use the database's restore utility to load the data.
After a bit of searching I found FBExport, that seems to be able to do exactly that - you'll just need to generate a CSV file and then use the FBExport tool to import that data into your database.
Mysql database design: table with half a million rows
- It's not a good idea to create tables during runtime, so you should create one table in this case
- Reviews table - 500.000 rows is ok
- Vote Up table - Actually, the number of rows will not reach to 100.000.000 because it's impossible that each post will be voted by all the users, so you can create one table in this case as well
Efficient lookup in MySQL table with millions of rows
20 million rows is not a lot for MySQL. Just index the zip/postal code and it will be fast. Way under 200ms fast. No need to split between tables. MySQL does get slow when the result set is large, but it doesn't seem like you would encounter that issue. MySQL will do just fine with hundreds of millions of records for basic queries like yours.
You will need to adjust the MySQL settings so that it uses more memory. The default settings are pretty low.
MySQL does support spacial indexes. So you could pull the longitude/latitude for the postal codes and use a spacial index to do proximity searches. Doesn't seem like you are looking for that though.
If you want things really, really fast, go the route you were thinking of but use memcache or redis. You can use the zip/postal code as the lookup key. You would still need a persistant disk based data store to load the data from. I don't think memcache/redis is necessary, but it's an option.
Related Topics
How to Combine Two Completely Different SQL Queries into One Result
How to Select All the Columns of a Table Except One Column
How to Use SQL Like Condition With Multiple Values in Postgresql
How to Include Results of SQL Count If Count=0
Remove Decimal Values Using SQL Query
Getting Two Counts and Then Dividing Them
How to Check If More Than One Column Is Not Null Per Record
How to Calculate Percentage of Counts in SQL
Auto Increment the Column Value by 1 Conditionally in Select Query in SQL
Sql Select Count for Multiple Columns in a Single Query
Splitting Date into 2 Columns (Date + Time) in SQL
A SQL Query to Get All the Records Where 5 Columns Are Same But Only One Column Is Different
Get the Number of Digits After the Decimal Point of a Float (With or Without Decimal Part)
How to Remove Line Feed Characters When Selecting Data from SQL Server
Create a Query That Will Display Employee Name and Department Number
How to Compare Timestamp Dates With Date-Only Parameter in MySQL
SQL Select Only Rows with Max Value on a Column
Phpmyadmin Says No Privilege to Create Database, Despite Logged in as Root User