Split a String into Chunks of Specified Size Without Breaking Words

Split string into chunks (of different size) without breaking words

This doesn't do the trick?:

def get_chunks(str, n = 3)
str.scan(/^.{1,25}\b|.{1,35}\b/).first(n).map(&:strip)
end

Split a string into chunks of specified size without breaking words

How about:

str = "split a string into chunks according to a specific size. Seems easy enough, but here is the catch: I cannot be breaking words between chunks, so I need to catch when adding the next word will go over chunk size and start the next one (its ok if a chunk is less than specified size)." 
str.scan(/.{1,25}\W/)
=> ["split a string into ", "chunks according to a ", "specific size. Seems easy ", "enough, but here is the ", "catch: I cannot be ", "breaking words between ", "chunks, so I need to ", "catch when adding the ", "next word will go over ", "chunk size and start the ", "next one (its ok if a ", "chunk is less than ", "specified size)."]

Update after @sawa comment:

str.scan(/.{1,25}\b|.{1,25}/).map(&:strip)

This is better as it doesn't require a string to end with \W

And it will handle words longer than specified length. Actually it will split them, but I assume this is desired behaviour

chunk/split a string in Javascript without breaking words

Something like this?

var n = 80;

while (n) {
if (input[n++] == ' ') {
break;
}
}

output = input.substring(0,n).split(' ');
console.log(output);

UPDATED

Now that I re-read the question, here's an updated solution:

var len = 80;
var curr = len;
var prev = 0;

output = [];

while (input[curr]) {
if (input[curr++] == ' ') {
output.push(input.substring(prev,curr));
prev = curr;
curr += len;
}
}
output.push(input.substr(prev));

Split string into chunks of maximum character count without breaking words

This is what worked for me (thanks to @StefanPochmann's comments):

text = "Some really long string\nwith some line breaks"

The following will first remove all whitespace before breaking the string up.

text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)

The resulting chunks of strings will lose all the line breaks (\n) from the original string. If you need to maintain the line breaks, you need to replace them all with some random placeholder (before applying the regex), for example: (br), that you can use to restore the line breaks later. Like this:

text = "Some really long string\nwith some line breaks".gsub("\n", "(br)")

After we run the regex, we can restore the line breaks for the new chunks by replacing all occurrences of (br) with \n like this:

chunks = text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)
chunks.each{|chunk| chunk.gsub!('(br)', "\n")}

Looks like a long process but it worked for me.

Splitting a string into chunks of a certain size

static IEnumerable<string> Split(string str, int chunkSize)
{
return Enumerable.Range(0, str.Length / chunkSize)
.Select(i => str.Substring(i * chunkSize, chunkSize));
}

Please note that additional code might be required to gracefully handle edge cases (null or empty input string, chunkSize == 0, input string length not divisible by chunkSize, etc.). The original question doesn't specify any requirements for these edge cases and in real life the requirements might vary so they are out of scope of this answer.

Splitting a string into chunks of a certain nth sizes

What about this:

string t1 = str.Substring(0, 2);
string t2 = str.Substring(2, 2);
string t3 = str.Substring(4, 4);

Split String into given length but do not split word

Assuming you have a table of addresses, I'd use a recursive CTE.

On each iteration, find the last possible space to break on, then start the next iteration for the character after the space.

  • take 31 characters
  • reverse them
  • find the position of the first space

Extra care to be taken for corner cases:

  • The remaining string to be searched is less than 30 characters
  • The current string being searched has no space in the first 31 characters

Using the following test data...

CREATE TABLE test (
address VARCHAR(MAX)
);

INSERT INTO
test
VALUES
('216 Apartment123 AreaArea SampleWord1 Word2 MiddleTown Upper1Location Another5 NewYork'),
('216 Apartment123 AreaArea SampleWord1 Word2 MiddleTownxx Upper1LocationUpper1LocationUpper1Location Another5 NewYork'),
('216 Apartment123 AreaArea SampleWord1 Word2 MiddleTownxx Upper1LocationUpper1LocationUpper1Location Another5 NewYork x')

;

Using the following CTE...

DECLARE @chars BIGINT = 30;

WITH
parts AS
(
SELECT
address,
LEN(address) AS length,
CAST(0 AS BIGINT) AS last_space,
CAST(1 AS BIGINT) AS next,
address AS fragment
FROM
test

UNION ALL

SELECT
parts.address,
parts.length,
last_space.pos,
parts.next + COALESCE(last_space.pos, @chars),
SUBSTRING(parts.address, parts.next, COALESCE(last_space.pos - 1, @chars))
FROM
parts
CROSS APPLY
(
SELECT
@chars + 2
-
NULLIF(
CHARINDEX(
' ',
REVERSE(
SUBSTRING(
parts.address + ' ',
parts.next,
@chars + 1
)
)
)
, 0
)
)
last_space(pos)
WHERE
parts.next <= parts.length
)

SELECT
*, len(fragment) AS chars
FROM
parts
WHERE
next > 1
ORDER BY
address,
next

https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=acd11f2bc73e5036bd82498ecf14b08f



Related Topics



Leave a reply



Submit