Java - Io Bound Thread - 1:1 Threading Model

java.io.InputStream - read()

For a single threaded Java program, Does blocking of read() enable kernel to move java process to blocked state?

There is no such thing as a single-threaded Java program, but if there was, yes.

For a multi-threaded Java program, Does blocking of read() on one thread allow other threads to occupy CPU slice?

Of course. Otherwise threads would be pointless.

Having java process continue in Running state until it's CPU time slice.

Yes.

Why does OS require/maintain kernel-land threads?

I think the use of the word kernel thread is a bit misleading in these figures. I know the figures from a book about operating system (design) and if I remember correctly, they refer to the way how work is scheduled by the operating system.
In the figures, each process has at least one kernel thread assigned that is scheduled by the kernel.

The N-1 model shows multiple user-land threads that are not known to the kernel at all because the latter schedules the process (or how it's called in the figure, a single kernel thread) only. So for the kernel, each process is a kernel thread. When the process is assigned a slice of processor time, it itself runs multiple threads by scheduling them at its own discretion.

In the 1-1 model, the kernel is aware of the user-land threads and each thread is considered for processor time assignment by the scheduler. So instead of scheduling a whole process, the kernel switches between threads inside of processes.

The hybrid model combines both principles, where lightweight processes are actually threads known to the kernel and which are scheduled for execution by it. Additionally, they implement threads the kernel is not aware of and assign processor time in user-land.

And now to be completely confused, there is actually a real kernel thread in Linux. But as far as I understand the concept, these threads are used for kernel-space operations only, e.g. when kernel modules need to do things in parallel.

How to get an ideal number of threads in parallel programs in Java?

The most important consideration is whether your application/calculation is CPU-bound or IO-bound.

If it's IO-bound (a single thread is spending most of its time waiting for external esources such as database connections, file systems, or other external sources of data) then you can assign (many) more threads than the number of available processors - of course how many depends also on how well the external resource scales though - local file systems, not that much probably.
If it's (mostly) CPU bound, then slightly over the number of
available processors is probably best.

How many threads should I use in my Java program?

Threads are fine, but as others have noted, you have to be highly aware of your bottlenecks. Your algorithm sounds like it would be susceptible to cache contention between multiple CPUs - this is particularly nasty because it has the potential to hit the performance of all of your threads (normally you think of using multiple threads to continue processing while waiting for slow or high latency IO operations).

Cache contention is a very important aspect of using multi CPUs to process a highly parallelized algorithm: Make sure that you take your memory utilization into account. If you can construct your data objects so each thread has it's own memory that it is working on, you can greatly reduce cache contention between the CPUs. For example, it may be easier to have a big array of ints and have different threads working on different parts of that array - but in Java, the bounds checks on that array are going to be trying to access the same address in memory, which can cause a given CPU to have to reload data from L2 or L3 cache.

Splitting the data into it's own data structures, and configure those data structures so they are thread local (might even be more optimal to use ThreadLocal - that actually uses constructs in the OS that provide guarantees that the CPU can use to optimize cache.

The best piece of advice I can give you is test, test, test. Don't make assumptions about how CPUs will perform - there is a huge amount of magic going on in CPUs these days, often with counterintuitive results. Note also that the JIT runtime optimization will add an additional layer of complexity here (maybe good, maybe not).

How does threading save time?

Consider a scenario where there only a single core processor exists. Splitting your task into multiple threads uses the same process context (shared resource) and they run simultaneously. As the threads are just sharing time, how come their run time (turnaround time) is less than a single threaded process?

You are entirely correct to be skeptical of any claimed speedup here.

First off, as Servy and others point out in their answers, if the jobs are not processor bound then clearly there can be some speedups here because while the processor is idle waiting for the disk or the network to come back, it could be doing the work of another thread.

But let's suppose you have two processor-bound tasks, a single processor, and either two threads or one thread. In the one-thread scenario it goes like this:

Do 100% of the work of job 1. Suppose this takes 1000 ms.
Do 100% of the work of job 2. Suppose this takes 1000 ms.

Total time: two seconds. Total jobs done: two. But here's the important bit: the client that was waiting for job 1 got their job done in only one second. The client that was waiting for job 2 had to wait two seconds.

Now if we have two threads and one CPU it goes like this:

Do 10% of the work of job 1, for 100 ms.
Do 10% of the work of job 2, for 100 ms.
Do 10% of the work of job 1
Do 10% of the work of job 2
...

Again, total time two seconds, but this time the client that was waiting for job 1 got their job done in 1.9 seconds, nearly 100% slower than the one-thread scenario!

So that's the moral of the story here, that you are entirely correct to point out. If the following conditions are met:

The jobs are CPU-bound
There are more threads than CPUs
The work is useful only for its end result

Then adding more threads only slows you down.

Libraries such as the Task Parallel Library are designed for this scenario; they try to figure out when adding more threads will make things worse, and try to only schedule as many threads as there are CPUs to serve them.

Now, if any of those conditions are not met then adding more threads is a good idea.

If the jobs are not CPU bound then adding more threads allows the CPU to do work when it would otherwise be idle, waiting for network or disk.
If there are idle CPUs then adding more threads allows those CPUs to be scheduled.
If partially-computed results are useful then adding more threads improves the situation because there are more opportunities for clients to consume partially-computed results. In our second scenario, for instance, the clients of both jobs are getting partial results every 200 milliseconds, which is fair.

Java threads and number of cores

Processes vs Threads

In days of old, each process had precisely one thread of execution, so processes were scheduled onto cores directly (and in these old days, there was almost only one core to schedule onto). However, in operating systems that support threading (which is almost all moderns OS's), it is threads, not processes that are scheduled. So for the rest of this discussion we will talk exclusively about threads, and you should understand that each running process has one or more threads of execution.

Parallelism vs Concurrency

When two threads are running in parallel, they are both running at the same time. For example, if we have two threads, A and B, then their parallel execution would look like this:

CPU 1: A ------------------------->

CPU 2: B ------------------------->

When two threads are running concurrently, their execution overlaps. Overlapping can happen in one of two ways: either the threads are executing at the same time (i.e. in parallel, as above), or their executions are being interleaved on the processor, like so:

CPU 1: A -----------> B ----------> A -----------> B ---------->

So, for our purposes, parallelism can be thought of as a special case of concurrency*

Scheduling

But we are able to produce a thread pool(lets say 30) with a larger number than the number of cores that we posses(lets say 4) and have them run concurrently. How is this possible if we are only have 4 cores?

In this case, they can run concurrently because the CPU scheduler is giving each one of those 30 threads some share of CPU time. Some threads will be running in parallel (if you have 4 cores, then 4 threads will be running in parallel at any one time), but all 30 threads will be running concurrently. The reason you can then go play games or browse the web is that these new threads are added to the thread pool/queue and also given a share of CPU time.

Logical vs Physical Cores

According to my current understanding, a core can only perform 1 process at a time

This is not quite true. Due to very clever hardware design and pipelining that would be much too long to go into here (plus I don't understand it), it is possible for one physical core to actually be executing two completely different threads of execution at the same time. Chew over that sentence a bit if you need to -- it still blows my mind.

This amazing feat is called simultaneous multi-threading (or popularly Hyper-Threading, although that is a proprietary name for a specific instance of such technology). Thus, we have physical cores, which are the actual hardware CPU cores, and logical cores, which is the number of cores the operating system tells software is available for use. Logical cores are essentially an abstraction. In typical modern Intel CPUs, each physical core acts as two logical cores.

can someone explain how this works and also recommend some good reading on this?

I would recommend Operating System Concepts if you really want to understand how processes, threads, and scheduling all work together.

The precise meanings of the terms parallel and concurrent are hotly debated, even here in our very own stack overflow. What one means by these terms depends a lot on the application domain.

How to determine optimal number of threads for high latency network requests?

It will shock you, but you do not need any threads for I/O (quantitatively, this means 0 threads). It is good that you have studied that multithreading does not multiply your network bandwidth. Now, it is time to know that threads do computation. They are not doing the (high-latency) communication. The communication is performed by a network adapter, which is another process, running really in parallel with with CPU. It is stupid to allocate a thread (see which resources allocated are listed by this gentlemen who claims that you need 1 thread) just to sleep until network adapter finishes its job. You need no threads for I/O = you need 0 threads.

It makes sense to allocate the threads for computation to make in parallel with I/O request(s). The amount of threads will depend on the computation-to-communication ratio and limited by the number of cores in your CPU.

Sorry, I had to say that despite you have certainly implied the commitment to blocking I/O, so many people do not understand this basic thing. Take the advise, use asynchronous I/O and you'll see that the issue does not exist.