How to Chop a String into Chunks of a Given Length in Ruby

What is the best way to chop a string into chunks of a given length in Ruby?

Use String#scan:

>> 'abcdefghijklmnopqrstuvwxyz'.scan(/.{4}/)
=> ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx"]
>> 'abcdefghijklmnopqrstuvwxyz'.scan(/.{1,4}/)
=> ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]
>> 'abcdefghijklmnopqrstuvwxyz'.scan(/.{1,3}/)
=> ["abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yz"]

How can I split a string into chunks?

The problem is that you're trying to perform an enumerable method on a non-enumerable object (a string). You can try using scan on the string to find groups of 5:

arr = str.scan /.{1,5}/

If you wanted to go the enumerable route, you could first break up the string into a character array, get groups of 5, then join them back into 5-character strings:

arr = str.chars.each_slice(5).map(&:join)

Split string into chunks (of different size) without breaking words

This doesn't do the trick?:

def get_chunks(str, n = 3)
str.scan(/^.{1,25}\b|.{1,35}\b/).first(n).map(&:strip)
end

Split string into equal slices/chunks

What about this?

string.scan(/.{,#{L}}/)

Split a string into chunks of specified size without breaking words

How about:

str = "split a string into chunks according to a specific size. Seems easy enough, but here is the catch: I cannot be breaking words between chunks, so I need to catch when adding the next word will go over chunk size and start the next one (its ok if a chunk is less than specified size)." 
str.scan(/.{1,25}\W/)
=> ["split a string into ", "chunks according to a ", "specific size. Seems easy ", "enough, but here is the ", "catch: I cannot be ", "breaking words between ", "chunks, so I need to ", "catch when adding the ", "next word will go over ", "chunk size and start the ", "next one (its ok if a ", "chunk is less than ", "specified size)."]

Update after @sawa comment:

str.scan(/.{1,25}\b|.{1,25}/).map(&:strip)

This is better as it doesn't require a string to end with \W

And it will handle words longer than specified length. Actually it will split them, but I assume this is desired behaviour

Split string into chunks of maximum character count without breaking words

This is what worked for me (thanks to @StefanPochmann's comments):

text = "Some really long string\nwith some line breaks"

The following will first remove all whitespace before breaking the string up.

text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)

The resulting chunks of strings will lose all the line breaks (\n) from the original string. If you need to maintain the line breaks, you need to replace them all with some random placeholder (before applying the regex), for example: (br), that you can use to restore the line breaks later. Like this:

text = "Some really long string\nwith some line breaks".gsub("\n", "(br)")

After we run the regex, we can restore the line breaks for the new chunks by replacing all occurrences of (br) with \n like this:

chunks = text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)
chunks.each{|chunk| chunk.gsub!('(br)', "\n")}

Looks like a long process but it worked for me.

Chop a string in Ruby into fixed length string ignoring (not considering/regardless) new line or space characters


"This is some\nText\nThis is some text".scan(/.{1,17}/m)
# => ["This is some\nText", "\nThis is some tex", "t"]

Ruby: Split a string into substring of maximum 40 characters

Your first attempt:

sentence[0..40].gsub(/\s\w+$/,'')

almost works, but it has one fatal flaw. You are splitting on the number of characters before cutting off the last word. This means you have no way of knowing whether the bit being trimmed off was a whole word, or a partial word.

Because of this, your code will always cut off the last word.

I would solve the problem as follows:

sentence[/\A.{0,39}[a-z]\b/mi]
  • \A is an anchor to fix the regex to the start of the string.
  • .{0,39}[a-z] matches on 1 to 40 characters, where the last character must be a letter. This is to prevent the last selected character from being punctuation or space. (Is that desired behaviour? Your question didn't really specify. Feel free to tweak/remove that [a-z] part, e.g. [a-z.] to match a full stop, if desired.)
  • \b is a word boundary look-around. It is a zero-width matcher, on beginning/end of words.
  • /mi modifiers will include case insensitive (i.e. A-Z) and multi-line matches.

One very minor note is that because this regex is matching 1 to 40 characters (rather than zero), it is possible to get a null result. (Although this is seemingly very unlikely, since you'd need a 1-word, 41+ letter string!!) To account for this edge case, call .to_s on the result if needed.


Update: Thank you for the improved edit to your question, providing a concrete example of an input/result. This makes it much clearer what you are asking for, as the original post was somewhat ambiguous.

You could solve this with something like the following:

sentence.scan(/.{0,39}[a-z.!?,;](?:\b|$)/mi)
  • String#scan returns an array of strings that match the pattern - so you can then re-join these strings to reconstruct the original.
  • Again, I have added a few more characters (!?,;) to the list of "final characters in the substring". Feel free to tweak this as desired.
  • (?:\b|$) means "either a word boundary, or the end of the line". This fixes the issue of the result not including the final . in the substrings. Note that I have used a non-capture group (?:) to prevent the result of scan from changing.


Related Topics



Leave a reply



Submit