Loading and Displaying Large Text Files

Loading and displaying large text files

I would separate the problem.

The first one is model - Document building speed

The second is the Document rendering - building tree of views to represent the Document.

A question is whether you need font effects like keywords colorizing?

I would start from Document building part. IMHO reading the file via EditorKit.read() should be fast even for big files. I would use the PainDocument for the purpose and check whether the pure model is built fast enough for your application. If yes it's fine just use the Document as model. If not implement your own Document interface because AbstractDocument has plenty of methods for update processing (e.g. writeLock).

When we have the Document loading fast enough we have to solve the Document rendering. BY default the views used in javax.swing.text are really flexible. They are designed as base classes to be extended - thus has a lot of code we don't need. E.g. measuring.

For the feature I would use Monospaced font, we don't need wrap so measurements of the view widht is fast = longest row char count * char widht.

The height is also char height * amount of lines.

So our PLainTextViewReplacement is really fast. Also we don't have to render the whole view but just a fragment visible in our scroll pane. Thus rendering could be mmuch much faster.

Of course there should be a lot of work to provide correct caret navigation, selection etc.

Text editor to open big (giant, huge, large) text files

Free read-only viewers:

  • Large Text File Viewer (Windows) – Fully customizable theming (colors, fonts, word wrap, tab size). Supports horizontal and vertical split view. Also support file following and regex search. Very fast, simple, and has small executable size.
  • klogg (Windows, macOS, Linux) – A maintained fork of glogg. Its main feature is regular expression search. It supports monitoring file changes (like tail), bookmarks, highlighting patterns using different colors, and has serious optimizations built in. But from a UI standpoint, it's rather minimal.
  • LogExpert (Windows) – "A GUI replacement for tail." It's really a log file analyzer, not a large file viewer, and in one test it required 10 seconds and 700 MB of RAM to load a 250 MB file. But its killer features are the columnizer (parse logs that are in CSV, JSONL, etc. and display in a spreadsheet format) and the highlighter (show lines with certain words in certain colors). Also supports file following, tabs, multifiles, bookmarks, search, plugins, and external tools.
  • Lister (Windows) – Very small and minimalist. It's one executable, barely 500 KB, but it still supports searching (with regexes), printing, a hex editor mode, and settings.

Free editors:

  • Your regular editor or IDE. Modern editors can handle surprisingly large files. In particular, Vim (Windows, macOS, Linux), Emacs (Windows, macOS, Linux), Notepad++ (Windows), Sublime Text (Windows, macOS, Linux), and VS Code (Windows, macOS, Linux) support large (~4 GB) files, assuming you have the RAM.
  • Large File Editor (Windows) – Opens and edits TB+ files, supports Unicode, uses little memory, has XML-specific features, and includes a binary mode.
  • GigaEdit (Windows) – Supports searching, character statistics, and font customization. But it's buggy – with large files, it only allows overwriting characters, not inserting them; it doesn't respect LF as a line terminator, only CRLF; and it's slow.

Builtin programs (no installation required):

  • less (macOS, Linux) – The traditional Unix command-line pager tool. Lets you view text files of practically any size. Can be installed on Windows, too.
  • Notepad (Windows) – Decent with large files, especially with word wrap turned off.
  • MORE (Windows) – This refers to the Windows MORE, not the Unix more. A console program that allows you to view a file, one screen at a time.

Web viewers:

  • readfileonline.com – Another HTML5 large file viewer. Supports search.

Paid editors/viewers:

  • 010 Editor (Windows, macOS, Linux) – Opens giant (as large as 50 GB) files.
  • SlickEdit (Windows, macOS, Linux) – Opens large files.
  • UltraEdit (Windows, macOS, Linux) – Opens files of more than 6 GB, but the configuration must be changed for this to be practical: Menu » Advanced » Configuration » File Handling » Temporary Files » Open file without temp file...
  • EmEditor (Windows) – Handles very large text files nicely (officially up to 248 GB, but as much as 900 GB according to one report).
  • BssEditor (Windows) – Handles large files and very long lines. Don’t require an installation. Free for non commercial use.
  • loxx (Windows) – Supports file following, highlighting, line numbers, huge files, regex, multiple files and views, and much more. The free version can not: process regex, filter files, synchronize timestamps, and save changed files.

How do I read a text file of about 2 GB?

Try Glogg.
the fast, smart log explorer.

I have opened log file of size around 2 GB, and the search is also very fast.

What's the quickest method to display large text files in c#?

DataTable is bulky and slow.

Use WPF and GridView.

Create a class then a List of the class and bind to that.

If you want to dynamically update the UI then ObservableCollection and retrieve data on another thread.

How to read large text file on windows?

try this...

Large Text File Viewer

By the way, it is free :)

But, I think you should ask this on serverfault.com instead

read a large text file on server and display them on web page

The best way to resolve your issue is to implement pagination so you could get pages of book pages (pun not intended), if the book you have is stored on the backend.

If it is stored on the client computer - are you sure that you need this? There might be some other ways to resolve the problem you are facing, yet loading a 200-400mb file onto the browser is really a bad idea.

Best practice for loading a large text file using jQuery get()

is there a way to load a chunk of the target text file, parse the chunk, request another chunk, etc.

To retrieve a portion (or chunk) of a resource, use the HTTP Range header. Most web servers honor this header correctly.

example HTTP Request:

GET /resource/url/path HTTP/1.1\r\n
User-agent: whatever\r\n
Host: www.example.com\r\n
Range: bytes=0-3200\r\n\r\n

Update:

With jQuery, v1.5 and later, you can tell jQuery to send additional HTTP headers for an individual ajax call, like this:

    $.ajax({type: "GET",
url: url,
headers : { "Range" : 'bytes=0-3200' },
....

If you don't want to modify each $.ajax() call, Stofke says you can use a "beforeSend" fn. Something like this:

  $.ajaxSetup({ beforeSend : function(xhr) {
xhr.setRequestHeader("Range", "bytes=0-3200" );
}});

Using $.ajaxSetup(), you'd need to modify that range header for each additional call.

I don't know of a way to tell jQuery to load a resource chunkwise, automatically. I guess you'd have to do that yourself with setTimeout(), or something, in the success() of the ajax call. Each time you get a response, call setTimeout() and make the next ajax call. Your browser-resident Javascript will need to keep track of which portions of the file you have retrieved already, and which part (byte index) you want to get next.

Another way to do it would be to make the ajax calls only when your app is ready to do so. Rather than waiting for success() from the ajax call, just make an ajax call after you've finished processing the results of the first range retrieval.

Also: to support the arithmetic purposes, before doing the first GET, you can use the HEAD call to learn the length of the resource, which will tell you the max index in the range you can use.

loading chunks of large text files from DB

If you go for a relational database, I would store the files line by line in a table. That way it is easy to fetch lines:

SELECT line FROM documents
WHERE docname = 'mydoc'
AND line_nr > 100
ORDER BY line_nr
FETCH FIRST 50 ROWS ONLY;

A b-tree index on (docname, line_nr) would make the query very efficient.

If you want to keep the table from getting too large, use range partitioning on docname.

Displaying large text files with WPF C#

I have this file reading algorithm from a proof of concept application (which was also a log file viewer/diff viewer). The implementation requires C# 8.0 (.NET Core 3.x or .NET 5). I removed some indexing, cancellation etc. to remove noise and to show the core business of the algorithm.

It performs quite fast and compares very well with editors like Visual Code. It can't get much faster. To keep the UI responsive I highly recommend to use UI virtualization. If you implement UI virtualization, then the bottleneck will be the file reading operation. You can tweak the algorithm's performance by using different partition sizes (you can implement some smart partitioning to calculate them dynamically).

The key parts of the algorithm are

  • asynchronous implementation of Producer-Consumer pattern using Channel
  • partitioning of the source file into blocks of n bytes
  • parallel processing of file partitions (concurrent file reading)
  • merging the result document blocks and overlapping lines

DocumentBlock.cs

The result struct that holds the lines of a processed file partition.

public readonly struct DocumentBlock
{
public DocumentBlock(long rank, IList<string> content, bool hasOverflow)
{
this.Rank = rank;
this.Content = content;
this.HasOverflow = hasOverflow;
}

public long Rank { get; }
public IList<string> Content { get; }
public bool HasOverflow { get; }
}

ViewModel.cs

The entry point is the public ViewModel.ReadFileAsync member.

class ViewModel : INotifyPropertyChanged
{
public ViewModel() => this.DocumentBlocks = new ConcurrentBag<DocumentBlock>();

// TODO::Make reentrant
// (for example cancel running operations and
// lock/synchronize the method using a SemaphoreSlim)
public async Task ReadFileAsync(string filePath)
{
using var cancellationTokenSource = new CancellationTokenSource();

this.DocumentBlocks.Clear();
this.EndOfFileReached = false;

// Create the channel (Producer-Consumer implementation)
BoundedChannelOptions channeloptions = new BoundedChannelOptions(Environment.ProcessorCount)
{
FullMode = BoundedChannelFullMode.Wait,
AllowSynchronousContinuations = false,
SingleWriter = true
};

var channel = Channel.CreateBounded<(long PartitionLowerBound, long PartitionUpperBound)>(channeloptions);

// Create consumer threads
var tasks = new List<Task>();
for (int threadIndex = 0; threadIndex < Environment.ProcessorCount; threadIndex++)
{
Task task = Task.Run(async () => await ConsumeFilePartitionsAsync(channel.Reader, filePath, cancellationTokenSource));
tasks.Add(task);
}

// Produce document byte blocks
await ProduceFilePartitionsAsync(channel.Writer, cancellationTokenSource.Token);
await Task.WhenAll(tasks);
CreateFileContent();
this.DocumentBlocks.Clear();
}

private void CreateFileContent()
{
var document = new List<string>();
string overflowingLineContent = string.Empty;
bool isOverflowMergePending = false;

var orderedDocumentBlocks = this.DocumentBlocks.OrderBy(documentBlock => documentBlock.Rank);
foreach (var documentBlock in orderedDocumentBlocks)
{
if (isOverflowMergePending)
{
documentBlock.Content[0] += overflowingLineContent;
isOverflowMergePending = false;
}

if (documentBlock.HasOverflow)
{
overflowingLineContent = documentBlock.Content.Last();
documentBlock.Content.RemoveAt(documentBlock.Content.Count - 1);
isOverflowMergePending = true;
}

document.AddRange(documentBlock.Content);
}

this.FileContent = new ObservableCollection<string>(document);
}

private async Task ProduceFilePartitionsAsync(
ChannelWriter<(long PartitionLowerBound, long PartitionUpperBound)> channelWriter,
CancellationToken cancellationToken)
{
var iterationCount = 0;
while (!this.EndOfFileReached)
{
try
{
var partition = (iterationCount++ * ViewModel.PartitionSizeInBytes,
iterationCount * ViewModel.PartitionSizeInBytes);
await channelWriter.WriteAsync(partition, cancellationToken);
}
catch (OperationCanceledException)
{}
}
channelWriter.Complete();
}

private async Task ConsumeFilePartitionsAsync(
ChannelReader<(long PartitionLowerBound, long PartitionUpperBound)> channelReader,
string filePath,
CancellationTokenSource waitingChannelWritertCancellationTokenSource)
{
await using var file = File.OpenRead(filePath);
using var reader = new StreamReader(file);

await foreach ((long PartitionLowerBound, long PartitionUpperBound) filePartitionInfo
in channelReader.ReadAllAsync())
{
if (filePartitionInfo.PartitionLowerBound >= file.Length)
{
this.EndOfFileReached = true;
waitingChannelWritertCancellationTokenSource.Cancel();
return;
}

var documentBlockLines = new List<string>();
file.Seek(filePartitionInfo.PartitionLowerBound, SeekOrigin.Begin);
var filePartition = new byte[filePartitionInfo.PartitionUpperBound - partition.PartitionLowerBound];
await file.ReadAsync(filePartition, 0, filePartition.Length);

// Extract lines
bool isLastLineComplete = ExtractLinesFromFilePartition(documentBlockLines, filePartition);

bool documentBlockHasOverflow = !isLastLineComplete && file.Position != file.Length;
var documentBlock = new DocumentBlock(partition.PartitionLowerBound, documentBlockLines, documentBlockHasOverflow);
this.DocumentBlocks.Add(documentBlock);
}
}

private bool ExtractLinesFromFilePartition(byte[] filePartition, List<string> resultDocumentBlockLines)
{
bool isLineFound = false;
for (int bufferIndex = 0; bufferIndex < filePartition.Length; bufferIndex++)
{
isLineFound = false;
int lineBeginIndex = bufferIndex;
while (bufferIndex < filePartition.Length
&& !(isLineFound = ((char)filePartition[bufferIndex]).Equals('\n')))
{
bufferIndex++;
}

int lineByteCount = bufferIndex - lineBeginIndex;
if (lineByteCount.Equals(0))
{
documentBlockLines.Add(string.Empty);
}
else
{
var lineBytes = new byte[lineByteCount];
Array.Copy(filePartition, lineBeginIndex, lineBytes, 0, lineBytes.Length);
string lineContent = Encoding.UTF8.GetString(lineBytes).Trim('\r');
resultDocumentBlockLines.Add(lineContent);
}
}

return isLineFound;
}

protected virtual void OnPropertyChanged([CallerMemberName] string propertyName = "")
=> this.PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName));

public event PropertyChangedEventHandler PropertyChanged;
private const long PartitionSizeInBytes = 100000;
private bool EndOfFileReached { get; set; }
private ConcurrentBag<DocumentBlock> DocumentBlocks { get; }

private ObservableCollection<string> fileContent;
public ObservableCollection<string> FileContent
{
get => this.fileContent;
set
{
this.fileContent = value;
OnPropertyChanged();
}
}
}

To implement a very simple UI virtualization, this example uses a plain ListBox, where all mouse effects are removed from the ListBoxItem elements in order to get rid of the ListBox look and feel (a indetermintae progress indicator is highly recommended). You can enhance the example to allow multi-line text selection (e.g., to allow to copy text to the clipboard).

MainWindow.xaml

<Window>
<Window.DataContext>
<ViewModel />
</Window.DataContext>

<ListBox ScrollViewer.VerticalScrollBarVisibility="Visible"
ItemsSource="{Binding FileContent}"
Height="400" >
<ListBox.ItemContainerStyle>
<Style TargetType="ListBoxItem">
<Setter Property="Template">
<Setter.Value>
<ControlTemplate TargetType="ListBoxItem">
<ContentPresenter />
</ControlTemplate>
</Setter.Value>
</Setter>
</Style>
</ListBox.ItemContainerStyle>
</ListBox>
</Window>

If you are more advanced, you can simply implement your own powerful document viewer e.g., by extending the VirtualizingPanel and using low-level text rendering. This allows you to increase performance in case you are interested in text search and highlighting (in this context stay far away from RichTextBox (or FlowDocument) as it is too slow).

At least you have a good performing text file reading algorithm you can use to generate the data source for your UI implementation.

If this viewer is not your main product, but a simple development tool to aid you in processing log files, I don't recommend to implement your own log file viewer. There are plenty of free and paid applications out there.



Related Topics



Leave a reply



Submit