Threading in Python

What is the use of join() in threading?

A somewhat clumsy ascii-art to demonstrate the mechanism:
The join() is presumably called by the main-thread. It could also be called by another thread, but would needlessly complicate the diagram.

join-calling should be placed in the track of the main-thread, but to express thread-relation and keep it as simple as possible, I choose to place it in the child-thread instead.

without join:
+---+---+------------------                     main-thread
    |   |
    |   +...........                            child-thread(short)
    +..................................         child-thread(long)

with join
+---+---+------------------***********+###      main-thread
    |   |                             |
    |   +...........join()            |         child-thread(short)
    +......................join()......         child-thread(long)

with join and daemon thread
+-+--+---+------------------***********+###     parent-thread
  |  |   |                             |
  |  |   +...........join()            |        child-thread(short)
  |  +......................join()......        child-thread(long)
  +,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,     child-thread(long + daemonized)

'-' main-thread/parent-thread/main-program execution
'.' child-thread execution
'#' optional parent-thread execution after join()-blocked parent-thread could 
    continue
'*' main-thread 'sleeping' in join-method, waiting for child-thread to finish
',' daemonized thread - 'ignores' lifetime of other threads;
    terminates when main-programs exits; is normally meant for 
    join-independent tasks

So the reason you don't see any changes is because your main-thread does nothing after your join.
You could say join is (only) relevant for the execution-flow of the main-thread.

If, for example, you want to concurrently download a bunch of pages to concatenate them into a single large page, you may start concurrent downloads using threads, but need to wait until the last page/thread is finished before you start assembling a single page out of many. That's when you use join().

Creating Threads in python

You don't need to use a subclass of Thread to make this work - take a look at the simple example I'm posting below to see how:

from threading import Thread
from time import sleep

def threaded_function(arg):
    for i in range(arg):
        print("running")
        sleep(1)

if __name__ == "__main__":
    thread = Thread(target = threaded_function, args = (10, ))
    thread.start()
    thread.join()
    print("thread finished...exiting")

Here I show how to use the threading module to create a thread which invokes a normal function as its target. You can see how I can pass whatever arguments I need to it in the thread constructor.

Multi Threading python 3.7

To configure your routers simultaneously, you first need to make sure it is safe to do so with the module/library you are using. If only one instance should be running at a time, you will just have to wait for each router to be configured one by one. For example, if the process you are using reads/writes to the same file/database while running and you multi-thread that process, you can end up with a corrupted file/database.

If you determine that your operation is safe to multi-thread, try something like the following code:

import threading

# Create a class for multi-threading the configuration of a router
class RouterConfigure(threading.Thread):
    def __init__(self, ip: str, hostname: str):
        threading.Thread.__init__(self)
        self.ip = ip
        self.hostname = hostname

    def run(self):
        console_config.config_vrdc(self.hostname, self.ip)

threads = []
hosts = create_vrdc_dic()

# Create a thread for each IP/hostname in the dict
for i in hosts.keys():
    hostname = hosts[i].get('hostname')
    ip = hosts[i].get('mgmt_ip')
    
    thread = RouterConfigure(ip=ip, hostname=hostname)
    thread.start()
    threads.append(thread)

# Wait for threads to finish before exiting main program
for t in threads:
    t.join()

If there are more hosts to configure than your system can handle with multiple threads, figure out how to split the process up into chunks so that only x threads are running at the same time.

How do I stop a thread in python which itself is being called inside a loop?

So as an update, I have managed to resolve this issue. The problem with the other answer stated by me (shown below) is that just .cancel() by itself only seemed to work for one timer thread. But as can be seen in the problem, childFunction() itself calls childFunction() and can also be called by the parentFunction, meaning that there may be multiple timer threads.

What worked for my specific case was naming my threads as below:

t1 = threading.Timer(10, childFunction, args=(var1,var2,number))
t1.name = t1.name + "_timer" + str(number)
t1.start()

Thereafter, I could cancel all timer threads that were created from this process by:

for timerthread in threading.enumerate():
    if timerthread.name.endswith('timer' + str(number)):
        timerthread.cancel()

Below is the ORIGINAL METHOD I USED WHICH CAUSED MANY ISSUES:

I'm not certain if this is a bad practice (in fact I feel it may be based on the answers linked in the question saying that we should never 'kill a thread'). I'm sure there are reasons why this is not good and I'd appreciate anyone telling me why. However, the solution that ultimately worked for me was to use .cancel().

So first change would be to assign your thread Timer to a variable instead of calling it directly. So instead of threading.Timer(10, childFunction, args=(var1,var2)).start(), it should be

t = threading.Timer(10, childFunction, args=(var1,var2))
t.start()

Following that, instead of end_process_and_exit_here(), you should use t.cancel(). This seems to work and stops all threads mid-process. However, the bad thing is that it doesn't seem to carry on with other parts of the program.

How to use threading to improve performance in python

This answer will address improving performance without using concurrency.

The way you structured your search you are looking for 13 million unique things in each sentence. You said it takes 3-5 minutes for each sentence and that the word lengths in concepts range from one to ten.

I think you can improve the search time by making a set of concepts (either initially when constructed or from your list) then splitting each sentence into strings of one to ten (consecutive) words and testing for membership in the set.

Example of a sentence split into 4 word strings:

'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems'
# becomes
[('data', 'mining', 'is', 'the'),
 ('mining', 'is', 'the', 'process'),
 ('is', 'the', 'process', 'of'),
 ('the', 'process', 'of', 'discovering'),
 ('process', 'of', 'discovering', 'patterns'),
 ('of', 'discovering', 'patterns', 'in'),
 ('discovering', 'patterns', 'in', 'large'),
 ('patterns', 'in', 'large', 'data'),
 ('in', 'large', 'data', 'sets'),
 ('large', 'data', 'sets', 'involving'),
 ('data', 'sets', 'involving', 'methods'),
 ('sets', 'involving', 'methods', 'at'),
 ('involving', 'methods', 'at', 'the'),
 ('methods', 'at', 'the', 'intersection'),
 ('at', 'the', 'intersection', 'of'),
 ('the', 'intersection', 'of', 'machine'),
 ('intersection', 'of', 'machine', 'learning'),
 ('of', 'machine', 'learning', 'statistics'),
 ('machine', 'learning', 'statistics', 'and'),
 ('learning', 'statistics', 'and', 'database'),
 ('statistics', 'and', 'database', 'systems')]

Process:

concepts = set(concepts)
sentence = sentence.split()
#one word
for meme in sentence:
    if meme in concepts:
        #keep it
#two words
for meme in zip(sentence,sentence[1:]):
    if ' '.join(meme) in concepts:
        #keep it
#three words
for meme in zip(sentence,sentence[1:],sentence[2:]):
    if ' '.join(meme) in concepts:
        #keep it

Adapting an itertools recipe (pairwise) you can automate that process of making n-word strings from a sentence:

from itertools import tee
def nwise(iterable, n=2):
    "s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
    iterables = tee(iterable, n)
    # advance each iterable to the appropriate starting point
    for i, thing in enumerate(iterables[1:],1):
        for _ in range(i):
            next(thing, None)
    return zip(*iterables)

Testing each sentence looks like this

sentence = sentence.strip().split()
for n in [1,2,3,4,5,6,7,8,9,10]:
    for meme in nwise(sentence,n):
        if ' '.join(meme) in concepts:
            #keep meme

I made a set of 13e6 random strings with 20 characters each to approximate concepts.

import random, string
data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))

Testing a four or forty character string for membership in data consistently takes about 60 nanoseconds. A one hundred word sentence has 955 one to ten word strings so searching that sentence should take ~60 microseconds.

The first sentence from your example 'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems' has 195 possible concepts (one to ten word strings). Timing for the following two functions is about the same: about 140 microseconds for f and 150 microseconds for g:

def f(sentence, data=data, nwise=nwise):
    '''iterate over memes in sentence and see if they are in data'''
    sentence = sentence.strip().split()
    found = []
    for n in [1,2,3,4,5,6,7,8,9,10]:
        for meme in nwise(sentence,n):
            meme = ' '.join(meme)
            if meme in data:
                found.append(meme)
    return found

def g(sentence, data=data, nwise=nwise):
    'make a set of the memes in sentence then find its intersection with data'''
    sentence = sentence.strip().split()
    test_strings = set(' '.join(meme) for n in range(1,11) for meme in nwise(sentence,n))
    found = test_strings.intersection(data)
    return found

So these are just approximations since I'm not using your actual data but it should speed things up quite a bit.

After testing with your example data I found that g won't work if a concept appears twice in a sentence.

So here it is all together with the concepts listed in the order they are found in each sentence. The new version of f will take longer but the added time should be relatively small. If possible would you post a comment letting me know how much longer it is than the original? (I'm curious).

from itertools import tee

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

concepts = set(concepts)

def nwise(iterable, n=2):
    "s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
    iterables = tee(iterable, n)
    # advance each iterable to the appropriate starting point
    for i, thing in enumerate(iterables[1:],1):
        for _ in range(i):
            next(thing, None)
    return zip(*iterables)

def f(sentence, concepts=concepts, nwise=nwise):
    '''iterate over memes in sentence and see if they are in concepts'''
    indices = set()
    #print(sentence)
    words = sentence.strip().split()
    for n in [1,2,3,4,5,6,7,8,9,10]:
        for meme in nwise(words,n):
            meme = ' '.join(meme)
            if meme in concepts:
                start = sentence.find(meme)
                end = len(meme)+start
                while (start,end) in indices:
                    #print(f'{meme} already found at character:{start} - looking for another one...') 
                    start = sentence.find(meme, end)
                    end = len(meme)+start
                indices.add((start, end))
    return [sentence[start:end] for (start,end) in sorted(indices)]

###########
results = []
for sentence in sentences:
    results.append(f(sentence))
    #print(f'{sentence}\n\t{results[-1]})')

In [20]: results
Out[20]: 
[['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'],
 ['data mining', 'interdisciplinary subfield', 'information', 'information'],
 ['data mining', 'knowledge discovery', 'databases process', 'process']]