Why Is Variable1 += Variable2 Much Faster Than Variable1 = Variable1 + Variable2

Why is variable1 += variable2 much faster than variable1 = variable1 + variable2?

This isn't about using inplace += versus + binary add. You didn't tell us the whole story. Your original version concatenated 3 strings, not just two:

sTable = sTable + '\n' + sRow  # simplified, sRow is a function call

Python tries to help out and optimises string concatenation; both when using strobj += otherstrobj and strobj = strobj + otherstringobj, but it cannot apply this optimisation when more than 2 strings are involved.

Python strings are immutable normally, but if there are no other references to the left-hand string object and it is being rebound anyway, then Python cheats and mutates the string. This avoids having to create a new string each time you concatenate, and that can lead to a big speed improvement.

This is implemented in the bytecode evaluation loop. Both when using BINARY_ADD on two strings and when using INPLACE_ADD on two strings, Python delegates concatenation to a special helper function string_concatenate(). To be able to optimize the concatenation by mutating the string, it first needs to make sure that the string has no other references to it; if only the stack and the original variable reference it then this can be done, and the next operation is going to replace the original variable reference.

So if there are just 2 references to the string, and the next operator is one of STORE_FAST (set a local variable), STORE_DEREF (set a variable referenced by closed over functions) or STORE_NAME (set a global variable), and the affected variable currently references the same string, then that target variable is cleared to reduce the number of references to just 1, the stack.

And this is why your original code could not use this optimization fully. The first part of your expression is sTable + '\n' and the next operation is another BINARY_ADD:

>>> import dis
>>> dis.dis(compile(r"sTable = sTable + '\n' + sRow", '<stdin>', 'exec'))
  1           0 LOAD_NAME                0 (sTable)
              3 LOAD_CONST               0 ('\n')
              6 BINARY_ADD          
              7 LOAD_NAME                1 (sRow)
             10 BINARY_ADD          
             11 STORE_NAME               0 (sTable)
             14 LOAD_CONST               1 (None)
             17 RETURN_VALUE

The first BINARY_ADD is followed by a LOAD_NAME to access the sRow variable, not a store operation. This first BINARY_ADD must always result in a new string object, ever larger as sTable grows and it takes more and more time to create this new string object.

You changed this code to:

sTable += '\n%s' % sRow

which removed the second concatenation. Now the bytecode is:

>>> dis.dis(compile(r"sTable += '\n%s' % sRow", '<stdin>', 'exec'))
  1           0 LOAD_NAME                0 (sTable)
              3 LOAD_CONST               0 ('\n%s')
              6 LOAD_NAME                1 (sRow)
              9 BINARY_MODULO       
             10 INPLACE_ADD         
             11 STORE_NAME               0 (sTable)
             14 LOAD_CONST               1 (None)
             17 RETURN_VALUE

and all we have left is an INPLACE_ADD followed by a store. Now sTable can be altered in-place, not resulting in a ever larger new string object.

You'd have gotten the same speed difference with:

sTable = sTable + ('\n%s' % sRow)

here.

A time trial shows the difference:

>>> import random
>>> from timeit import timeit
>>> testlist = [''.join([chr(random.randint(48, 127)) for _ in range(random.randrange(10, 30))]) for _ in range(1000)]
>>> def str_threevalue_concat(lst):
...     res = ''
...     for elem in lst:
...         res = res + '\n' + elem
... 
>>> def str_twovalue_concat(lst):
...     res = ''
...     for elem in lst:
...         res = res + ('\n%s' % elem)
... 
>>> timeit('f(l)', 'from __main__ import testlist as l, str_threevalue_concat as f', number=10000)
6.196403980255127
>>> timeit('f(l)', 'from __main__ import testlist as l, str_twovalue_concat as f', number=10000)
2.3599119186401367

The moral of this story is that you should not be using string concatenation in the first place. The proper way to build a new string from loads of other strings is to use a list, then use str.join():

table_rows = []
for something in something_else:
    table_rows += ['\n', GetRow()]
sTable = ''.join(table_rows)

This is faster still:

>>> def str_join_concat(lst):
...     res = ''.join(['\n%s' % elem for elem in lst])
... 
>>> timeit('f(l)', 'from __main__ import testlist as l, str_join_concat as f', number=10000)
1.7978830337524414

but you cannot beat using just '\n'.join(lst):

>>> timeit('f(l)', 'from __main__ import testlist as l, nl_join_concat as f', number=10000)
0.23735499382019043

Why is if (variable1 % variable2 == 0) inefficient?

You are measuring the OSR (on-stack replacement) stub.

OSR stub is a special version of compiled method intended specifically for transferring execution from interpreted mode to compiled code while the method is running.

OSR stubs are not as optimized as regular methods, because they need a frame layout compatible with interpreted frame. I showed this already in the following answers: 1, 2, 3.

A similar thing happens here, too. While "inefficient code" is running a long loop, the method is compiled specially for the on-stack replacement right inside the loop. The state is transferred from the interpreted frame to OSR-compiled method, and this state includes progressCheck local variable. At this point JIT cannot replace the variable with the constant, and thus cannot apply certain optimizations like strength reduction.

In particular this means JIT does not replace integer division with multiplication. (See Why does GCC use multiplication by a strange number in implementing integer division? for the asm trick from an ahead-of-time compiler, when the value is a compile-time constant after inlining / constant-propagation, if those optimizations are enabled. An integer literal right in the % expression also gets optimized by gcc -O0, similar to here where it's optimized by the JITer even in an OSR stub.)

However, if you run the same method several times, the second and the subsequent runs will execute the regular (non-OSR) code, which is fully optimized. Here is a benchmark to prove the theory (benchmarked using JMH):

@State(Scope.Benchmark)
public class Div {

    @Benchmark
    public void divConst(Blackhole blackhole) {
        long startNum = 0;
        long stopNum = 100000000L;

        for (long i = startNum; i <= stopNum; i++) {
            if (i % 50000 == 0) {
                blackhole.consume(i);
            }
        }
    }

    @Benchmark
    public void divVar(Blackhole blackhole) {
        long startNum = 0;
        long stopNum = 100000000L;
        long progressCheck = 50000;

        for (long i = startNum; i <= stopNum; i++) {
            if (i % progressCheck == 0) {
                blackhole.consume(i);
            }
        }
    }
}

And the results:

# Benchmark: bench.Div.divConst

# Run progress: 0,00% complete, ETA 00:00:16
# Fork: 1 of 1
# Warmup Iteration   1: 126,967 ms/op
# Warmup Iteration   2: 105,660 ms/op
# Warmup Iteration   3: 106,205 ms/op
Iteration   1: 105,620 ms/op
Iteration   2: 105,789 ms/op
Iteration   3: 105,915 ms/op
Iteration   4: 105,629 ms/op
Iteration   5: 105,632 ms/op

# Benchmark: bench.Div.divVar

# Run progress: 50,00% complete, ETA 00:00:09
# Fork: 1 of 1
# Warmup Iteration   1: 844,708 ms/op          <-- much slower!
# Warmup Iteration   2: 105,893 ms/op          <-- as fast as divConst
# Warmup Iteration   3: 105,601 ms/op
Iteration   1: 105,570 ms/op
Iteration   2: 105,475 ms/op
Iteration   3: 105,702 ms/op
Iteration   4: 105,535 ms/op
Iteration   5: 105,766 ms/op

The very first iteration of divVar is indeed much slower, because of inefficiently compiled OSR stub. But as soon as the method reruns from the beginning, the new unconstrained version is executed which leverages all the available compiler optimizations.

Why is 'new_file += line + string' so much faster than 'new_file = new_file + line + string'?

CPython (the reference interpreter) has an optimization for in-place string concatenation (when the string being appended to has no other references). It can't apply this optimization as reliably when doing +, only += (+ involves two live references, the assignment target and the operand, and the former isn't involved in the + operation, so it's harder to optimize it).

You shouldn't rely on this though, per PEP 8:

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).
For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b . This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations that don't use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

Update based on question edits: Yeah, you broke the optimization. You concatenated many strings, not just one, and Python evaluates left-to-right, so it must do the left-most concatenation first. Thus:

new_file_content += line.strip() + "~" + row_region + "\n"

is not at all the same as:

new_file_content = new_file_content + line.strip() + "~" + row_region + "\n"

because the former concatenates all the new pieces together, then appends them to the accumulator string all at once, while the latter must evaluate each addition from left to right with temporaries that don't involve new_file_content itself. Adding parens for clarity, it's like you did:

new_file_content = (((new_file_content + line.strip()) + "~") + row_region) + "\n"

Because it doesn't actually know the types until it reaches them, it can't assume all of those are strings, so the optimization doesn't kick in.

If you changed the second bit of code to:

new_file_content = new_file_content + (line.strip() + "~" + row_region + "\n")

or slightly slower, but still many times faster than your slow code because it keeps the CPython optimization:

new_file_content = new_file_content + line.strip()
new_file_content = new_file_content + "~"
new_file_content = new_file_content + row_region
new_file_content = new_file_content + "\n"

so the accumulation was obvious to CPython, you'd fix the performance problem. But frankly, you should just be using += anytime you're performing a logical append operation like this; += exists for a reason, and it provides useful information to both the maintainer and the interpreter. Beyond that, it's good practice as far as DRY goes; why name the variable twice when you don't need to?

Of course, per PEP8 guidelines, even using += here is bad form. In most languages with immutable strings (including most non-CPython Python interpreters), repeated string concatenation is a form of Schlemiel the Painter's Algorithm, which causes serious performance problems. The correct solution is to build a list of strings, then join them all at one, e.g.:

    new_file_content = []
    for i, line in enumerate(content):
        if i==0:
            # In local tests, += anonymoustuple runs faster than
            # concatenating short strings and then calling append
            # Python caches small tuples, so creating them is cheap,
            # and using syntax over function calls is also optimized more heavily
            new_file_content += (line.strip(), "~region\n")
        else:
            country = line.split("~")[13]
            try:
                    row_region = regions[country]
            except KeyError:
                    row_region = "Undetermined"
            new_file_content += (line.strip(), "~", row_region, "\n")

    # Finished accumulating, make final string all at once
    new_file_content = "".join(new_file_content)

which is usually faster even when the CPython string concatenation options are available, and will be reliably fast on non-CPython Python interpreters as well because it uses a mutable list to accumulate results efficiently, then allows ''.join to precompute the total length of the string, allocate the final string all at once (instead of incremental resizes along the way), and populate it exactly once.

Side-note: For your specific case, you shouldn't be accumulating or concatenating at all. You've got an input file and an output file, and can process line by line. Every time you would append or accumulate file contents, just write them out instead (I've cleaned up the code a bit for PEP8 compliance and other minor style improvements while I was at it):

start_time = time.monotonic()  # You're on Py3, monotonic is more reliable for timing

# Use with statements for both input and output files
with open(fname) as f, open("CMDB_STAGE060.csv", "w") as new_file:
    # Iterate input file directly; readlines just means higher peak memory use
    # Maintaining your own counter is silly when enumerate exists
    for i, line in enumerate(f):
        if not i:
            # Write to file directly, don't store
            new_file.write(line.strip() + "~region\n")
        else:
            country = line.split("~")[13]
            # .get exists to avoid try/except when you have a simple, constant default
            row_region = regions.get(country, "Undetermined")
            # Write to file directly, don't store
            new_file.write(line.strip() + "~" + row_region + "\n")
end_time = time.monotonic()
# Print will stringify arguments and separate by spaces for you
print("total time:", end_time - start_time)

Implementation details deep dive

For those curious on implementation details, the CPython string concat optimization is implemented in the byte code interpreter, not on the str type itself (technically, PyUnicode_Append does the mutation optimization, but it requires help from the interpreter to fix up reference counts so it knows it can use the optimization safely; without interpreter help, only C extension modules would ever benefit from that optimization).

When the interpreter detects that both operands are the Python level str type (at C layer, in Python 3, it's still referred to as PyUnicode, a legacy of 2.x days that wasn't worth changing), it calls a special unicode_concatenate function, which checks whether the very next instruction is one of three basic STORE_* instructions. If it is, and the target is the same as the left operand, it clears the target reference so PyUnicode_Append will see only a single reference to the operand, allowing it to invoke the optimized code for a str with a single reference.

This means that not only can you break the optimization by doing

a = a + b + c

you can also break it any time the variable in question is not a top level (global, nested or local) name. If you're operating on an object attribute, a list index, a dict value, etc., even += won't help you, it won't see a "simple STORE", so it doesn't clear the target reference, and all of these get the ultraslow, not-in-place behavior:

foo.x += mystr
foo[0] += mystr
foo['x'] += mystr

It's also specific to the str type; in Python 2, the optimization doesn't help with unicode objects, and in Python 3, it doesn't help with bytes objects, and in neither version will it optimize for subclasses of str; those always take the slow path.

Basically, the optimization is there to be as nice as possible in the simplest common cases for people new to Python, but it's not going to go to serious trouble for even moderately more complex cases. This just reinforces the PEP8 recommendation: Depending on implementation details of your interpreter is a bad idea when you could run faster on every interpreter, for any store target, by doing the right thing and using str.join.

what does {$variable1}{$variable2} mean?

Simple answer is yes, it is the same. A variable contained within those brackets outputs that variable.
Note that you can't do anything like this, sadly:

$a = "hello: {$a+1}";

$b = "world: {someFunction($a)}";

It's just a quick way of outputting variables so things don't get all complicated.
Try looking at:

echo "hello mr {$name}!";

echo "hello mr ".$name."!";

The first is much clearer.

It is much better to put variables in braces because it looks nicer, is clearer to read and also lets you have letters directly after the variable name, e.g:

echo "Today is the {$date}st";

what is the meaning of the following expression in c++

It concatenates the two variables.

Suppose you have two chars, a and b. a|b<<8 shifts the b 8 bits to the left, | sets every bit that is in a or b.

So in this example the result would be "ab".

'a' is 97, 'b' is 98, so bitwise the following happens:

a:      01100001
b:              01100010
b<<8:   0110001000000000
a|b<<8: 0110001001100001

Is it faster (or more efficient) to create a variable rather than use an operation in the output?

In general, you don't need to worry about such operations. The code which you have written will be automatically optimized by compiler; and so this won't make your program slower.

In this case, the compiler will generate a temp variable and store the value into temp var before the print operation. This temp variable will replace (1024 * kiloByte) in the string to be printed; ans so, the operation is not done multiple times. This is a type of basic optimization done by compilers called "Common_subexpression_elimination" (http://en.wikipedia.org/wiki/Common_subexpression_elimination).
And usually Java compiler is very good in optimization compared to other compilers.

If you want to know more about General compiler optimization techniques, you can read : http://en.wikipedia.org/wiki/Optimizing_compiler

Before passing to println, the compiler will even generate a temp variable for the combined string which you are printing. Click here for an example.

Use variables when looping. (ex: first loop Variable1=value1 and in second loop variable1=value2)

edited for some code speed improvements after OP's sharing an example file

you can have your BOI sub accept varying strings as parameters and be called by a main sub looping through all of them

like follows

Option Explicit

Sub main() '<~~ main sub calling BOI inside a loop
    Dim VARIABLE1 As Variant, VARIABLE2 As Variant, VARIABLE3 As Variant
    Dim i As Long

    VARIABLE1 = Array("CMGLT", "CMCLT", "VARIABLE3", "VARIABLE4") '<~~ "main" array containing all VARIABLE1 needed values
    VARIABLE2 = Array("114486744", "104074162", "VARIABLE2-3", "VARIABLE2-4") ' <~~ VARIABLE2 array, with elements corresponding by position to VARIABLE1 ones -> total elements number must match VARIABLE1 one
    VARIABLE3 = Array("VARIABLE3-1", "VARIABLE3-2", "VARIABLE3-3", "VARIABLE3-4") ' <~~ VARIABLE3 array, with elements corresponding by position to VARIABLE1 ones -> total elements number must match VARIABLE1 one

    For i = 0 To UBound(VARIABLE1) ' <~~ loop over your VARIABLE1 array
        Call BOI(VARIABLE1(i), VARIABLE2(i), VARIABLE3(i)) ' <~~ and pass VARIABLE2 and VARIABLE3 corresponding elements, too
    Next i
End Sub

Sub BOI(VARIABLE1 As Variant, VARIABLE2 As Variant, VARIABLE3 As Variant)
    Dim LR As Long
    Dim found As Range

    If Not WorksheetExists(CStr(VARIABLE1)) Then '---------------VARIABLE1
        Sheets.Add.Name = CStr(VARIABLE1)
    Else

        'START GEN CODE
        With Worksheets(CStr(VARIABLE1)) '---------------VARIABLE1 '<~~ instead of selecting/activating wanted sheet, tell VBA to consider it as implicit object for any subsequent methods or properties calls

            'Checking company code
            If InStr(1, .Range("G8"), CStr(VARIABLE2)) Then '---------------VARIABLE2

                'unmerge entire sheet
                .UsedRange.UnMerge '<~~ VBA reads this statement as "Worksheets(CStr(VARIABLE1)).UsedRange.Unmerge"
                'unwrap entire sheet
                .UsedRange.WrapText = False '<~~ act on usedrange only, to be faster

                LR = .Range("A" & .Rows.Count).End(xlUp).Row '<~~ store last non empty row index

                'clearing all rows below "Total"
                Set found = .Range("A2:A" & LR).SpecialCells(xlCellTypeConstants, xlTextValues).Find(what:="Total", LookIn:=xlValues, lookat:=xlWhole) '<~~ search into relevant cells only
                If Not found Is Nothing Then .Rows(found.Row & ":" & LR).Clear '<~~ Clear() is faster then Delete()

                'set short date format for up to 3000 rows
                LR = .Range("A" & .Rows.Count).End(xlUp).Row '<~~ update last non empty row index (you possibly deleted some rows before)
'                .Range("A2", "A3000").NumberFormat = "dd/mm/yyyy"
                .Range("A2:A" & LR).SpecialCells(xlCellTypeConstants, xlNumbers).NumberFormat = "dd/mm/yyyy" '<~~ act on relevant cells only
                'delete blank rows in column A
                .Range("A1:A" & LR).SpecialCells(xlCellTypeBlanks).EntireRow.Delete '<~~ avoid deleting blank rows after the last non empty one
                'delete rows from 1 to 6
                .Rows("1:6").EntireRow.Delete

                LR = .Range("A" & .Rows.Count).End(xlUp).Row '<~~ update last non empty row index (you possibly deleted some rows before)
                'changing column width of B column
                .Range("B1").ColumnWidth = 12
                'changing column width of A column
                .Range("A1").ColumnWidth = 12
                'changing formating of B column to General
                .Range("B1:B" & LR).NumberFormat = "General" '<~~ act on relevant cells only
                'CHANGE THIS AS APPROPRIATELY!!!!
                .Range("B1").Value = VARIABLE3 '------------------------------------'VARIABLE3

                'getting date as value
                .Range("C1").FormulaR1C1 = "=TEXT(RC[-2],""DD.MM.YYYY"")" '<~~ instead of selecting and then acting on selection, just act directly on the range object

                'copying company code and date until last row of data
                .Range("B1").AutoFill Destination:=Range("B1:B" & LR)
                .Range("C1").AutoFill Destination:=Range("C1:C" & LR)

                'pasting date as value
    '            .Columns("C:C").Select
    '            Selection.Copy
    '            Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
    '                    :=False, Transpose:=False
    '            Application.CutCopyMode = False
                With .Columns("C:C").SpecialCells(xlCellTypeFormulas) '<~~ this is equivalent to what above, but much faster
                    .Value = .Value
                End With

                'deleting blank rows in amount column
                On Error Resume Next
                .Range("W1:W" & LR).SpecialCells(xlCellTypeBlanks).EntireRow.Delete '<~~ act on relevant cells only
                On Error GoTo 0 '<~~ always remember to set standard error trapping right after you don't need skipping errors anymore

                'coping data to "UP" sheet
                LR = .Cells(Rows.Count, 1).End(xlUp).Row '<~~ update last non empty row index (you possibly deleted some rows before)
                CopyValues .Range("C1:C" & LR), Worksheets("Up"), "A" '<~~ take advantage of a sub to avoid repeating same code
                CopyValues .Range("B1:B" & LR), Worksheets("Up"), "C" '<~~ take advantage of a sub to avoid repeating same code
                CopyValues .Range("C1:C" & LR), Worksheets("Up"), "D" '<~~ take advantage of a sub to avoid repeating same code
                CopyValues .Range("Q1:Q" & LR), Worksheets("Up"), "F" '<~~ take advantage of a sub to avoid repeating same code
                CopyValues .Range("Q1:Q" & LR), Worksheets("Up"), "I" '<~~ take advantage of a sub to avoid repeating same code
                CopyValues .Range("W1:W" & LR), Worksheets("Up"), "O" '<~~ take advantage of a sub to avoid repeating same code

                'END GEN CODE

            Else
                MsgBox (VARIABLE1 & " Validation Mismatch. Exiting...") '---------------VARIABLE1
                Exit Sub
            End If
        End With
    End If

End Sub

Sub CopyValues(sourceRng As Range, targetSht As Worksheet, targetCol As String)
    With targetSht
        .Range(targetCol & .Rows.Count).End(xlUp).Offset(1).Resize(sourceRng.Rows.Count).Value = sourceRng.Value
    End With
End Sub

Function WorksheetExists(WSName As String) As Boolean
    On Error Resume Next
    WorksheetExists = Worksheets(WSName).Name = WSName
    On Error GoTo 0
End Function

where I also made some more (of all possible) little code optimizations too

Which one is better in calling a function: two times or storing the result in a variable?

In this example, the 2nd form is better (in most plausible situations¹) from a performance perspective, and (IMO) more readable.

In general, there are a couple of trade-offs to consider:

Is readability more important than efficiency?
How much of a "performance hit" is it to call the method twice versus once?

Also, you also need to consider the cases where the method may have side-effects, and whether it may give a different answer on two successive calls. In those cases, calling the method twice is semantically different to calling it once and storing the result in a temporary variable.

^{1 - The index variable makes the stack frame 1 word larger. Normally this doesn't matter, but if the code is in a recursive method that gets called in a deeply recursive fashion, that 1 extra word multiplied by a number of nested calls could result in a StackOverflowError.}

Why Is Variable1 += Variable2 Much Faster Than Variable1 = Variable1 + Variable2