Splitting a string into chunks of a certain size
static IEnumerable<string> Split(string str, int chunkSize)
{
return Enumerable.Range(0, str.Length / chunkSize)
.Select(i => str.Substring(i * chunkSize, chunkSize));
}
Please note that additional code might be required to gracefully handle edge cases (null
or empty input string, chunkSize == 0
, input string length not divisible by chunkSize
, etc.). The original question doesn't specify any requirements for these edge cases and in real life the requirements might vary so they are out of scope of this answer.
What's the best way to split a string into fixed length chunks and work with them in Python?
One solution would be to use this function:
def chunkstring(string, length):
return (string[0+i:length+i] for i in range(0, len(string), length))
This function returns a generator, using a generator comprehension. The generator returns the string sliced, from 0 + a multiple of the length of the chunks, to the length of the chunks + a multiple of the length of the chunks.
You can iterate over the generator like a list, tuple or string - for i in chunkstring(s,n):
, or convert it into a list (for instance) with list(generator)
. Generators are more memory efficient than lists because they generator their elements as they are needed, not all at once, however they lack certain features like indexing.
This generator also contains any smaller chunk at the end:
>>> list(chunkstring("abcdefghijklmnopqrstuvwxyz", 5))
['abcde', 'fghij', 'klmno', 'pqrst', 'uvwxy', 'z']
Example usage:
text = """This is the first line.
This is the second line.
The line below is true.
The line above is false.
A short line.
A very very very very very very very very very long line.
A self-referential line.
The last line.
"""
lines = (i.strip() for i in text.splitlines())
for line in lines:
for chunk in chunkstring(line, 16):
print(chunk)
Splitting a string into evenly-sized chunks
You are confusing the number of chunks with the chunk size.
You must calculate the chunk size with:
int chunkSize = s.Length / chunks;
If the length of the string is not divisible by chunks
, this will truncate the result because integer arithmetic is performed here. E.g., if the string size is 7 and chunks = 3
, then this will yield 2
.
And you have a remainder of 1
. If the string size was 8
, the chunk size would still be 2
, but the remainder would be 2
. Now, you must distribute this remainder among the chunks.
You can get the remainder with the modulo operator %
:
int remainder = s.Length % chunks;
Since you want the first chunks to be bigger, we now attribute this remainder to the first chunks:
int start = 0;
while (start < s.Length)
{
int thisChunkSize = chunkSize;
if (remainder > 0)
{
thisChunkSize++;
remainder--;
}
yield return s.Substring(start, thisChunkSize);
start += thisChunkSize;
}
If you need an even better distribution, you can use floating point arithmetic and round. The MidpointRounding
tells what happens when rounding a value with a .5
fraction.
public static IEnumerable<string> EvenIterator(string s, int chunks)
{
int start = 0;
var rounding = new[] { MidpointRounding.ToPositiveInfinity,
MidpointRounding.ToNegativeInfinity };
int r = 0;
while (start < s.Length) {
int chunkSize = (int)Math.Round((double)(s.Length - start) / chunks, rounding[r]);
r = 1 - r; // Swap the rounding
yield return s.Substring(start, chunkSize);
start += chunkSize;
chunks--;
}
}
A test with "abcdefghijklmno"
and chunk size 6
gives:
[ "abc", "de", "fgh", "ij", "klm", "no" ]
Splitting a string into chunks of a certain nth sizes
What about this:
string t1 = str.Substring(0, 2);
string t2 = str.Substring(2, 2);
string t3 = str.Substring(4, 4);
How to split a string into chunks of a particular byte size?
Using Buffer
seems indeed the right direction. Given that:
Buffer
prototype hasindexOf
andlastIndexOf
methods, and- 32 is the ASCII code of a space, and
- 32 can never occur as part of a multi-byte character since all the bytes that make up a multi-byte sequence always have the most significant bit set.
... you can proceed as follows:
function chunk(s, maxBytes) {
let buf = Buffer.from(s);
const result = [];
while (buf.length) {
let i = buf.lastIndexOf(32, maxBytes+1);
// If no space found, try forward search
if (i < 0) i = buf.indexOf(32, maxBytes);
// If there's no space at all, take the whole string
if (i < 0) i = buf.length;
// This is a safe cut-off point; never half-way a multi-byte
result.push(buf.slice(0, i).toString());
buf = buf.slice(i+1); // Skip space (if any)
}
return result;
}
console.log(chunk("Hey there! € 100 to pay", 12));
// -> [ 'Hey there!', '€ 100 to', 'pay' ]
You can consider extending this to also look for TAB, LF, or CR as split-characters. If so, and your input text can have CRLF sequences, you would need to detect those as well to avoid getting orphaned CR or LF characters in the chunks.
You can turn the above function into a generator, so that you control when you want to start the processing for getting the next chunk:
function * chunk(s, maxBytes) {
let buf = Buffer.from(s);
while (buf.length) {
let i = buf.lastIndexOf(32, maxBytes+1);
// If no space found, try forward search
if (i < 0) i = buf.indexOf(32, maxBytes);
// If there's no space at all, take all
if (i < 0) i = buf.length;
// This is a safe cut-off point; never half-way a multi-byte
yield buf.slice(0, i).toString();
buf = buf.slice(i+1); // Skip space (if any)
}
}
for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);
Browsers
Buffer
is specific to Node. Browsers however implement TextEncoder
and TextDecoder
, which leads to similar code:
function * chunk(s, maxBytes) {
const decoder = new TextDecoder("utf-8");
let buf = new TextEncoder("utf-8").encode(s);
while (buf.length) {
let i = buf.lastIndexOf(32, maxBytes+1);
// If no space found, try forward search
if (i < 0) i = buf.indexOf(32, maxBytes);
// If there's no space at all, take all
if (i < 0) i = buf.length;
// This is a safe cut-off point; never half-way a multi-byte
yield decoder.decode(buf.slice(0, i));
buf = buf.slice(i+1); // Skip space (if any)
}
}
for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);
Splitting a string into fixed-size chunks
If you want speed, Rcpp
is always a good choice:
library(Rcpp);
cppFunction('
List strsplitN(std::vector<std::string> v, int N ) {
if (N < 1) throw std::invalid_argument("N must be >= 1.");
List res(v.size());
for (int i = 0; i < v.size(); ++i) {
int num = v[i].size()/N + (v[i].size()%N == 0 ? 0 : 1);
std::vector<std::string> resCur(num,std::string(N,0));
for (int j = 0; j < num; ++j) resCur[j].assign(v[i].substr(j*N,N));
res[i] = resCur;
}
return res;
}
');
ch <- paste(rep('a',1e6),collapse='');
system.time({ res <- strsplitN(ch,2L); });
## user system elapsed
## 0.109 0.015 0.121
head(res[[1L]]); tail(res[[1L]]);
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
length(res[[1L]]);
## [1] 500000
Useful reference: http://gallery.rcpp.org/articles/strings_with_rcpp/.
More demos:
strsplitN(c('abcd','efgh'),2L);
## [[1]]
## [1] "ab" "cd"
##
## [[2]]
## [1] "ef" "gh"
##
strsplitN(c('abcd','efgh'),3L);
## [[1]]
## [1] "abc" "d"
##
## [[2]]
## [1] "efg" "h"
##
strsplitN(c('abcd','efgh'),1L);
## [[1]]
## [1] "a" "b" "c" "d"
##
## [[2]]
## [1] "e" "f" "g" "h"
##
strsplitN(c('abcd','efgh'),5L);
## [[1]]
## [1] "abcd"
##
## [[2]]
## [1] "efgh"
##
strsplitN(character(),5L);
## list()
strsplitN(c('abcd','efgh'),0L);
## Error: N must be >= 1.
There are two important caveats with the above implementation:
1: It doesn't handle NA
s correctly. Rcpp seems to stringify to 'NA'
when it's forced to come up with a std::string
. You can easily solve this in Rland with a wrapper that replaces the offending list components with a true NA
.
x <- c('a',NA); strsplitN(x,1L);
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "N" "A"
##
x <- c('a',NA); ifelse(is.na(x),NA,strsplitN(x,1L));
## [[1]]
## [1] "a"
##
## [[2]]
## [1] NA
##
2: It doesn't handle multibyte characters correctly. This is a tougher problem, and would require a rewrite of the core function implementation to use a Unicode-aware traversal. Fixing this problem would also incur a significant performance penalty, since you wouldn't be able to preallocate each vector in one shot prior to the assignment loop.
strsplitN('aΩ',1L);
## [[1]]
## [1] "a" "\xce" "\xa9"
##
strsplit('aΩ','');
## [[1]]
## [1] "a" "Ω"
##
Split large string in n-size chunks in JavaScript
You can do something like this:
"1234567890".match(/.{1,2}/g);
// Results in:
["12", "34", "56", "78", "90"]
The method will still work with strings whose size is not an exact multiple of the chunk-size:
"123456789".match(/.{1,2}/g);
// Results in:
["12", "34", "56", "78", "9"]
In general, for any string out of which you want to extract at-most n-sized substrings, you would do:
str.match(/.{1,n}/g); // Replace n with the size of the substring
If your string can contain newlines or carriage returns, you would do:
str.match(/(.|[\r\n]){1,n}/g); // Replace n with the size of the substring
As far as performance, I tried this out with approximately 10k characters and it took a little over a second on Chrome. YMMV.
This can also be used in a reusable function:
function chunkString(str, length) {
return str.match(new RegExp('.{1,' + length + '}', 'g'));
}
Split string into strings by length?
>>> x = "qwertyui"
>>> chunks, chunk_size = len(x), len(x)//4
>>> [ x[i:i+chunk_size] for i in range(0, chunks, chunk_size) ]
['qw', 'er', 'ty', 'ui']
Split a string to even sized chunks
Use textwrap.wrap
:
>>> import textwrap
>>> s = 'Splitting a String into Chunks of a Certain SizeSplitting a String into Chunks of a Certain Sizeaaaaaaa'
>>> textwrap.wrap(s, 4)
['aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaa']
Split String into smaller Strings by length variable
You need to use a loop:
public static IEnumerable<string> SplitByLength(this string str, int maxLength) {
for (int index = 0; index < str.Length; index += maxLength) {
yield return str.Substring(index, Math.Min(maxLength, str.Length - index));
}
}
Alternative:
public static IEnumerable<string> SplitByLength(this string str, int maxLength) {
int index = 0;
while(true) {
if (index + maxLength >= str.Length) {
yield return str.Substring(index);
yield break;
}
yield return str.Substring(index, maxLength);
index += maxLength;
}
}
2nd alternative: (For those who can't stand while(true)
)
public static IEnumerable<string> SplitByLength(this string str, int maxLength) {
int index = 0;
while(index + maxLength < str.Length) {
yield return str.Substring(index, maxLength);
index += maxLength;
}
yield return str.Substring(index);
}
Related Topics
When to Use .First and When to Use .Firstordefault With Linq
Deserializing Json to .Net Object Using Newtonsoft (Or Linq to Json Maybe)
Sharing Sessions Across Applications Using the ASP.NET Session State Service
Do You Need to Dispose of Objects and Set Them to Null
C# Constructor Execution Order
How to Access Random Item in List
Check Whether an Array Is a Subset of Another
Which Parsers Are Available For Parsing C# Code
My Algorithm to Calculate Position of Smartphone - Gps and Sensors
Given a Filesystem Path, Is There a Shorter Way to Extract the Filename Without Its Extension
Awaiting Multiple Tasks With Different Results
How to Reflect Over the Members of Dynamic Object
How to Create Json String in C#
Factory Method With Di and Ioc
How to Do Constructor Chaining in C#
Anyone Know a Good Workaround For the Lack of an Enum Generic Constraint