How to Load 100 Million Rows in to Memory

How much memory do I need to have for 100 million records

With 100,000,000 records, you need to allow for overhead. Exactly what and how much overhead you'll have will depend on the language.

In C/C++, for example, fields in a structure or class are aligned onto specific boundaries. Details may vary depending on the compiler, but in general int's must begin at an address that is a multiple of 4, short's at a multiple of 2, char's can begin anywhere.

So assuming that your 4+2+1 means an int, a short, and a char, then if you arrange them in that order, the structure will take 7 bytes, but at the very minimum the next instance of the structure must begin at a 4-byte boundary, so you'll have 1 pad byte in the middle. I think, in fact, most C compilers require structs as a whole to begin at an 8-byte boundary, though in this case that doesn't matter.

Every time you allocate memory there's some overhead for allocation block. The compiler has to be able to keep track of how much memory was allocated and sometimes where the next block is. If you allocate 100,000,000 records as one big "new" or "malloc", then this overhead should be trivial. But if you allocate each one individually, then each record will have the overhead. Exactly how much that is depends on the compiler, but, let's see, one system I used I think it was 8 bytes per allocation. If that's the case, then here you'd need 16 bytes for each record: 8 bytes for block header, 7 for data, 1 for pad. So it could easily take double what you expect.

Other languages will have different overhead. The easiest thing to do is probably to find out empirically: Look up what the system call is to find out how much memory you're using, then check this value, allocate a million instances, check it again and see the difference.

How to import 100 million rows table into database?

100 million rows daily?

You have to be realistic. I doubt any single instance of any database out there can handle this type of thouroughput efficiently. You should probably look at clustering options and other optimising techniques such as splitting data in two diffent DB's (sharding).

MySQL Enterprise has a bunch of features built-in that could ease and moniter the clustering process, but I think MySQL community edition supports it too.

Good-luck!

Improving speed and memory consumption when handling ArrayList with 100 million elements

First of all, don't do Files.readAllLines(Paths.get("filename")) and then pass everything to a Set, that holds unnecesserily huge amounts of data. Try to hold as few lines as possible at all times.

Read the files line-by-line and process as you go. This immediately cuts your memory usage by a lot.

Set<String> oldData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("oldData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// process your line, maybe add to the Set for the old data?
oldData.add(line);
}
}

Set<String> newData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("newData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// Is it enough just to remove from old data so that you'll end up with only the difference between old and new?
boolean oldRemoved = oldData.remove(line);
if (!oldRemoved) {
newData.add(line);
}
}
}

You'll end up with two sets containing only the data that is present in the old, or the new dataset, respectively.

Second of all, try to presize your containers if at all possible. Their size (usually) doubles when they reach their capacity, and that could potentially create a lot of overhead when dealing with big collections.

Also, if your data are numbers, you could just use a long and hold that instead of trying to hold instances of String? There's a lot of collection libraries that enable you to do this, e.g. Koloboke, HPPC, HPPC-RT, GS Collections, fastutil, Trove. Even their collections for Objects might serve you very well as a standard HashSet has a lot of unnecessary object allocation.

How do I prevent running out of memory when inserting a million rows in mysql with php

On scaffolding my application I installed the itsgoingd/clockwork package. It's a habit I've gotten into because it's such a useful tool. (I recommend any Laravel devs reading this to check it out!)

Of course this package logs queries for debugging purposes and thus, was the culprit eating up all of my memory.

Disabling the package by deleting the reference to its service provider in my configuration solved the issue.

Cheers to @dbushy727 for giving me a push in the right direction to figure this out.



Related Topics



Leave a reply



Submit