Converting White Space into Line Break

Replace all whitespace with a line break/paragraph mark to make a word list

For reasonably modern versions of sed, edit the standard input to yield the standard output with

$ echo 'τέχνη βιβλίο γη κήπος' | sed -E -e 's/[[:blank:]]+/\n/g'
τέχνη
βιβλίο
γη
κήπος

If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with

sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab

What it means:

  • The character class [[:blank:]] matches either a single space character or
    a single tab character.

    • Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
    • The + quantifier means match one or more of the previous pattern.
    • So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
  • The \n in the replacement is the newline that you want.
  • The /g modifier on the end means perform the substitution as many times as possible rather than just once.
  • The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]\+/\n/g'. (Note the use of \+ rather than simple +.)

Perl Compatible Regexes

For those familiar with Perl-compatible regexes and a PCRE-capable sed, use \s+ to match runs of at least one whitespace character, as in

sed -E -e 's/\s+/\n/g' old > new

or

sed -e 's/\s\+/\n/g' old > new

These commands read input from the file old and write the result to a file named new in the current directory.

Maximum portability, maximum cruftiness

Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.

$ echo 'τέχνη βιβλίο γη κήπος' | sed -e 's/[ \t][ \t]*/\
/g'
τέχνη
βιβλίο
γη
κήπος

Notes:

  • Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ \t]) followed by zero or more of them ([ \t]*).
  • Similarly, assuming sed does not understand \n for newline, we have to include it on the command line verbatim.

    • The \ and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.

      • Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
    • This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.

Note on backslashes and quoting

The commands above all used single quotes ('') rather than double quotes (""). Consider:

$ echo '\\\\' "\\\\"
\\\\ \\

That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.

White-Space Preservation, Line Break Ignoring

I'm sorry, but I'm not sure if there's an easy CSS solution to this. It might just be easier to use JavaScript:

HTML:
<p id = "code">[stuff]</p>
<!-- Set the CSS white-space property to pre-wrap. -->

JavaScript:
var codeElem = document.getElementById("code");
codeElem.innerHTML = codeElem.innerHTML.split("\n").join(" ↵ ");

How to remove white space, line breaks etc from a string in python

You can use the expression in re.sub:

(?:[;\n']|\s{2,})
  • (?: Non capturing group

    • [;\n'] Characters ; , \n and '.
    • | Or
    • \s{2,} Whitespace, two or more.
  • ) Close non capturing group.

Python code:

import re
mystr = "\n', ' var [3:0] apple [1:0];\n', ' int mango;\n', ' float banana [5:0];\n', ' int lichi;\n', ' "

print(re.sub(r"(?:[;\n']|\s{2,})",r'',mystr)[2:])

Prints the desired output:

var [3:0] apple [1:0], int mango, float banana [5:0], int lichi, 

Render a string in HTML and preserve spaces and linebreaks

Just style the content with white-space: pre-wrap;.

div {    white-space: pre-wrap;}
<div>This is some text   with some extra spacing    and afew newlines along with some trailing spaces             and five leading spaces thrown infor                                              goodmeasure                                              </div>

Regex Python - Replace any combination of line breaks, tabs, spaces, by single space

Try using \s, which matches all whitespace characters.

>>> import re
>>> s = 'Copyright ©\n\t\t\t\n\t\t\t2019\n\t\t\tApple Inc. All rights reserved.'
>>> s = re.sub("\s+", " ", s)
>>> s
'Copyright © 2019 Apple Inc. All rights reserved.'

How can I replace newlines/line breaks with spaces in javascript?

You can use the .replace() function:

words = words.replace(/\n/g, " ");

Note that you need the g flag on the regular expression to get replace to replace all the newlines with a space rather than just the first one.

Also, note that you have to assign the result of the .replace() to a variable because it returns a new string. It does not modify the existing string. Strings in Javascript are immutable (they aren't directly modified) so any modification operation on a string like .slice(), .concat(), .replace(), etc... returns a new string.

let words = "a\nb\nc\nd\ne";
console.log("Before:");
console.log(words);
words = words.replace(/\n/g, " ");

console.log("After:");
console.log(words);

Prevent/workaround browser converting '\n' between lines into space (for Chinese characters)

Browsers treat newlines as spaces because the specifications say so, ever since HTML 2.0. In fact, HTML 2.0 was milder than later specifications; it said: “An HTML user agent should treat end of line in any of its variations as a word space in all contexts except preformatted text.” (Conventional Representation of Newlines), whereas newer specifications say this stronger (describing it as what happens in HTML).

The background is that HTML and the Web was developed with mainly Western European languages in mind; this is reflected in many features of the original specifications and early implementations. Only slowly have they been internationalized.

It is unlikely that the parsing rules will be changed. More likely, what might happen is sensitivity to language or character properties rendering. This would mean that a line break still gets taken as a space (and the DOM string will contain Ascii space character), but a string like 这是 一句话。 would be rendered as if the space were not there. This what the HTML 4.01 specification seems to refer to (White space). The text is somewhat confused, but I think it tries to say that the behavior would depend in the content language, either inferred by the browser or as declared in markup.

But browsers don’t do such things yet. Declaring the language of content, e.g. <html lang=zh>, is a good principle but has little practical impact—in rendering, it may affect the browser’s choice of a default font (but how many authors let browsers use their default fonts?). It may even result in added spacing, if the space character happens to be wider in the browser’s default font for the language specified.

According to the CSS3 Text draft, you could use the text-spacing property. The value none “Turns off all text-spacing features. All fullwidth characters are set with full-width glyphs.” Unfortunately, no browser seems to support this yet.

Replace tabs and spaces with a single space as well as carriage returns and newlines with a single newline

First, I'd like to point out that new lines can be either \r, \n, or \r\n depending on the operating system.

My solution:

echo preg_replace('/[ \t]+/', ' ', preg_replace('/[\r\n]+/', "\n", $string));

Which could be separated into 2 lines if necessary:

$string = preg_replace('/[\r\n]+/', "\n", $string);
echo preg_replace('/[ \t]+/', ' ', $string);

Update:

An even better solutions would be this one:

echo preg_replace('/[ \t]+/', ' ', preg_replace('/\s*$^\s*/m', "\n", $string));

Or:

$string = preg_replace('/\s*$^\s*/m', "\n", $string);
echo preg_replace('/[ \t]+/', ' ', $string);

I've changed the regular expression that makes multiple lines breaks into a single better. It uses the "m" modifier (which makes ^ and $ match the start and end of new lines) and removes any \s (space, tab, new line, line break) characters that are a the end of a string and the beginning of the next. This solve the problem of empty lines that have nothing but spaces. With my previous example, if a line was filled with spaces, it would have skipped an extra line.



Related Topics



Leave a reply



Submit