Splitting a String into Chunks of a Certain Size

Splitting a string into chunks of a certain size


static IEnumerable<string> Split(string str, int chunkSize)
{
return Enumerable.Range(0, str.Length / chunkSize)
.Select(i => str.Substring(i * chunkSize, chunkSize));
}

Please note that additional code might be required to gracefully handle edge cases (null or empty input string, chunkSize == 0, input string length not divisible by chunkSize, etc.). The original question doesn't specify any requirements for these edge cases and in real life the requirements might vary so they are out of scope of this answer.

What's the best way to split a string into fixed length chunks and work with them in Python?

One solution would be to use this function:

def chunkstring(string, length):
return (string[0+i:length+i] for i in range(0, len(string), length))

This function returns a generator, using a generator comprehension. The generator returns the string sliced, from 0 + a multiple of the length of the chunks, to the length of the chunks + a multiple of the length of the chunks.

You can iterate over the generator like a list, tuple or string - for i in chunkstring(s,n):
, or convert it into a list (for instance) with list(generator). Generators are more memory efficient than lists because they generator their elements as they are needed, not all at once, however they lack certain features like indexing.

This generator also contains any smaller chunk at the end:

>>> list(chunkstring("abcdefghijklmnopqrstuvwxyz", 5))
['abcde', 'fghij', 'klmno', 'pqrst', 'uvwxy', 'z']

Example usage:

text = """This is the first line.
This is the second line.
The line below is true.
The line above is false.
A short line.
A very very very very very very very very very long line.
A self-referential line.
The last line.
"""

lines = (i.strip() for i in text.splitlines())

for line in lines:
for chunk in chunkstring(line, 16):
print(chunk)

Splitting a string into evenly-sized chunks

You are confusing the number of chunks with the chunk size.

You must calculate the chunk size with:

int chunkSize = s.Length / chunks;

If the length of the string is not divisible by chunks, this will truncate the result because integer arithmetic is performed here. E.g., if the string size is 7 and chunks = 3, then this will yield 2.
And you have a remainder of 1. If the string size was 8, the chunk size would still be 2, but the remainder would be 2. Now, you must distribute this remainder among the chunks.

You can get the remainder with the modulo operator %:

int remainder = s.Length % chunks;

Since you want the first chunks to be bigger, we now attribute this remainder to the first chunks:

int start = 0;
while (start < s.Length)
{
int thisChunkSize = chunkSize;
if (remainder > 0)
{
thisChunkSize++;
remainder--;
}
yield return s.Substring(start, thisChunkSize);
start += thisChunkSize;
}

If you need an even better distribution, you can use floating point arithmetic and round. The MidpointRounding tells what happens when rounding a value with a .5 fraction.

public static IEnumerable<string> EvenIterator(string s, int chunks)
{
int start = 0;
var rounding = new[] { MidpointRounding.ToPositiveInfinity,
MidpointRounding.ToNegativeInfinity };
int r = 0;
while (start < s.Length) {
int chunkSize = (int)Math.Round((double)(s.Length - start) / chunks, rounding[r]);
r = 1 - r; // Swap the rounding
yield return s.Substring(start, chunkSize);
start += chunkSize;
chunks--;
}
}

A test with "abcdefghijklmno" and chunk size 6 gives:

[ "abc", "de", "fgh", "ij", "klm", "no" ]

Splitting a string into chunks of a certain nth sizes

What about this:

string t1 = str.Substring(0, 2);
string t2 = str.Substring(2, 2);
string t3 = str.Substring(4, 4);

How to split a string into chunks of a particular byte size?

Using Buffer seems indeed the right direction. Given that:

  • Buffer prototype has indexOf and lastIndexOf methods, and
  • 32 is the ASCII code of a space, and
  • 32 can never occur as part of a multi-byte character since all the bytes that make up a multi-byte sequence always have the most significant bit set.

... you can proceed as follows:

function chunk(s, maxBytes) {
let buf = Buffer.from(s);
const result = [];
while (buf.length) {
let i = buf.lastIndexOf(32, maxBytes+1);
// If no space found, try forward search
if (i < 0) i = buf.indexOf(32, maxBytes);
// If there's no space at all, take the whole string
if (i < 0) i = buf.length;
// This is a safe cut-off point; never half-way a multi-byte
result.push(buf.slice(0, i).toString());
buf = buf.slice(i+1); // Skip space (if any)
}
return result;
}

console.log(chunk("Hey there! € 100 to pay", 12));
// -> [ 'Hey there!', '€ 100 to', 'pay' ]

You can consider extending this to also look for TAB, LF, or CR as split-characters. If so, and your input text can have CRLF sequences, you would need to detect those as well to avoid getting orphaned CR or LF characters in the chunks.

You can turn the above function into a generator, so that you control when you want to start the processing for getting the next chunk:

function * chunk(s, maxBytes) {
let buf = Buffer.from(s);
while (buf.length) {
let i = buf.lastIndexOf(32, maxBytes+1);
// If no space found, try forward search
if (i < 0) i = buf.indexOf(32, maxBytes);
// If there's no space at all, take all
if (i < 0) i = buf.length;
// This is a safe cut-off point; never half-way a multi-byte
yield buf.slice(0, i).toString();
buf = buf.slice(i+1); // Skip space (if any)
}
}

for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);

Browsers

Buffer is specific to Node. Browsers however implement TextEncoder and TextDecoder, which leads to similar code:





function * chunk(s, maxBytes) {

const decoder = new TextDecoder("utf-8");

let buf = new TextEncoder("utf-8").encode(s);

while (buf.length) {

let i = buf.lastIndexOf(32, maxBytes+1);

// If no space found, try forward search

if (i < 0) i = buf.indexOf(32, maxBytes);

// If there's no space at all, take all

if (i < 0) i = buf.length;

// This is a safe cut-off point; never half-way a multi-byte

yield decoder.decode(buf.slice(0, i));

buf = buf.slice(i+1); // Skip space (if any)

}

}


for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);

Splitting a string into fixed-size chunks

If you want speed, Rcpp is always a good choice:

library(Rcpp);
cppFunction('
List strsplitN(std::vector<std::string> v, int N ) {
if (N < 1) throw std::invalid_argument("N must be >= 1.");
List res(v.size());
for (int i = 0; i < v.size(); ++i) {
int num = v[i].size()/N + (v[i].size()%N == 0 ? 0 : 1);
std::vector<std::string> resCur(num,std::string(N,0));
for (int j = 0; j < num; ++j) resCur[j].assign(v[i].substr(j*N,N));
res[i] = resCur;
}
return res;
}
');

ch <- paste(rep('a',1e6),collapse='');
system.time({ res <- strsplitN(ch,2L); });
## user system elapsed
## 0.109 0.015 0.121
head(res[[1L]]); tail(res[[1L]]);
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
length(res[[1L]]);
## [1] 500000

Useful reference: http://gallery.rcpp.org/articles/strings_with_rcpp/.


More demos:

strsplitN(c('abcd','efgh'),2L);
## [[1]]
## [1] "ab" "cd"
##
## [[2]]
## [1] "ef" "gh"
##
strsplitN(c('abcd','efgh'),3L);
## [[1]]
## [1] "abc" "d"
##
## [[2]]
## [1] "efg" "h"
##
strsplitN(c('abcd','efgh'),1L);
## [[1]]
## [1] "a" "b" "c" "d"
##
## [[2]]
## [1] "e" "f" "g" "h"
##
strsplitN(c('abcd','efgh'),5L);
## [[1]]
## [1] "abcd"
##
## [[2]]
## [1] "efgh"
##
strsplitN(character(),5L);
## list()
strsplitN(c('abcd','efgh'),0L);
## Error: N must be >= 1.

There are two important caveats with the above implementation:

1: It doesn't handle NAs correctly. Rcpp seems to stringify to 'NA' when it's forced to come up with a std::string. You can easily solve this in Rland with a wrapper that replaces the offending list components with a true NA.

x <- c('a',NA); strsplitN(x,1L);
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "N" "A"
##
x <- c('a',NA); ifelse(is.na(x),NA,strsplitN(x,1L));
## [[1]]
## [1] "a"
##
## [[2]]
## [1] NA
##

2: It doesn't handle multibyte characters correctly. This is a tougher problem, and would require a rewrite of the core function implementation to use a Unicode-aware traversal. Fixing this problem would also incur a significant performance penalty, since you wouldn't be able to preallocate each vector in one shot prior to the assignment loop.

strsplitN('aΩ',1L);
## [[1]]
## [1] "a" "\xce" "\xa9"
##
strsplit('aΩ','');
## [[1]]
## [1] "a" "Ω"
##

Split large string in n-size chunks in JavaScript

You can do something like this:

"1234567890".match(/.{1,2}/g);
// Results in:
["12", "34", "56", "78", "90"]

The method will still work with strings whose size is not an exact multiple of the chunk-size:

"123456789".match(/.{1,2}/g);
// Results in:
["12", "34", "56", "78", "9"]

In general, for any string out of which you want to extract at-most n-sized substrings, you would do:

str.match(/.{1,n}/g); // Replace n with the size of the substring

If your string can contain newlines or carriage returns, you would do:

str.match(/(.|[\r\n]){1,n}/g); // Replace n with the size of the substring

As far as performance, I tried this out with approximately 10k characters and it took a little over a second on Chrome. YMMV.

This can also be used in a reusable function:

function chunkString(str, length) {
return str.match(new RegExp('.{1,' + length + '}', 'g'));
}

Split string into strings by length?


>>> x = "qwertyui"
>>> chunks, chunk_size = len(x), len(x)//4
>>> [ x[i:i+chunk_size] for i in range(0, chunks, chunk_size) ]
['qw', 'er', 'ty', 'ui']

Split a string to even sized chunks

Use textwrap.wrap:

>>> import textwrap
>>> s = 'Splitting a String into Chunks of a Certain SizeSplitting a String into Chunks of a Certain Sizeaaaaaaa'
>>> textwrap.wrap(s, 4)
['aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaa']

Split String into smaller Strings by length variable

You need to use a loop:

public static IEnumerable<string> SplitByLength(this string str, int maxLength) {
for (int index = 0; index < str.Length; index += maxLength) {
yield return str.Substring(index, Math.Min(maxLength, str.Length - index));
}
}

Alternative:

public static IEnumerable<string> SplitByLength(this string str, int maxLength) {
int index = 0;
while(true) {
if (index + maxLength >= str.Length) {
yield return str.Substring(index);
yield break;
}
yield return str.Substring(index, maxLength);
index += maxLength;
}
}

2nd alternative: (For those who can't stand while(true))

public static IEnumerable<string> SplitByLength(this string str, int maxLength) {
int index = 0;
while(index + maxLength < str.Length) {
yield return str.Substring(index, maxLength);
index += maxLength;
}

yield return str.Substring(index);
}


Related Topics



Leave a reply



Submit