Read from Word Document Line by Line

is there a way to read a word document line by line

I would suggest following the code on this page here

The crux of it is that you read it with a Word.ApplicationClass (Microsoft.Interop.Word) object, although where he's getting the "Doc" object is beyond me. I would assume you create it with the ApplicationClass.

EDIT: Document is retrieved by calling this:

Word.Document doc = wordApp.Documents.Open(ref file, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj);

Sadly the formatting of the code on the page I linked wasn't all to easy.

EDIT2: From there you can loop through doc paragraphs, however as far as I can see there is no way of looping through lines. I would suggest using some pattern matching to find linebreaks.

In order to extract the text from a paragraph, use Word.Paragraph.Range.Text, this will return all the text inside a paragraph. Then you must search for linebreak characters. I'd use string.IndexOf().

Alternatively, if by lines you want to extract one sentence at a time, you can simply iterate through Range.Sentences

Read from word document line by line

Ok. I found the solution here.


The final code is as follows:

Application word = new Application();
Document doc = new Document();

object fileName = path;
// Define an object to pass to the API for missing parameters
object missing = System.Type.Missing;
doc = word.Documents.Open(ref fileName,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing);

String read = string.Empty;
List<string> data = new List<string>();
for (int i = 0; i < doc.Paragraphs.Count; i++)
{
string temp = doc.Paragraphs[i + 1].Range.Text.Trim();
if (temp != string.Empty)
data.Add(temp);
}
((_Document)doc).Close();
((_Application)word).Quit();

GridView1.DataSource = data;
GridView1.DataBind();

How to do check the paragraphs content to read a .docx file line-by-line in c#

OK, Plan B.

The database: a table for each chapter is bad design. I therefore used one table for all of it.

This one is a bit quick and dirty. Normally you would have one table for the chapters, one for the subchapters with a column for chapter ID. I recommend improving this once it works.

This is SQLite but you can easily adapt this to InnoDb:

CREATE TABLE Chapters (
sID integer PRIMARY KEY AUTOINCREMENT,
chapter text NOT NULL,
subheading1 text NOT NULL,
contents1 text NULL,
subheading2 text NOT NULL,
contents2 text NULL
)

Since we are basically handling plain text, let us reduce Interop to a bare minimum and do the rest with regex:

var wdApp = new Word.Application();
var doc = wdApp.Documents.Open(@"D:\00_Projekte_temp\Lorem ipsum.docx");

var txt = doc.Content.Text;

doc.Close(false);
wdApp.Quit();

var rex = new Regex(@"(Chapter[\s\t])(.+?)([\r\n]+?)(\s?\-\s?)(.+?[\r\n]+?)(.+?)([\r\n]+?)(\-\s)(.+?[\r\n]+?)(.+?[\r\n])");
var mCol = rex.Matches(txt);

foreach (Match m in mCol)
{
var chap = m.Groups[2].Value;
var subh1 = m.Groups[5].Value;
var cont1 = m.Groups[6].Value;
var subh2 = m.Groups[9].Value;
var cont2 = m.Groups[10].Value;

//write to db
var strSql = @"INSERT INTO Chapters (chapter, subheading1, contents1, subheading2, contents2) VALUES ($chap, $sub1, $con1, $sub2, $con2)";
using (var con = new SQLiteConnection("Data Source =\"D:\\00_Projekte_temp\\wordtest.db\";Version=3"))
{
con.Open();
using (var cmd = new SQLiteCommand(strSql, con))
{
cmd.Parameters.AddWithValue("$chap", chap);
cmd.Parameters.AddWithValue("$sub1", subh1);
cmd.Parameters.AddWithValue("$con1", cont1);
cmd.Parameters.AddWithValue("$sub2", subh2);
cmd.Parameters.AddWithValue("$con2", cont2);
cmd.ExecuteNonQuery();
}
con.Close();
}
}

I also recommend for the future that either your authors send you plain text directly or that you move from Interop to OpenXml, since that makes you independent from Word and thus also runnable on a server.

VB.Net: Searching Word Document By Line

As already mentioned in the comments, you can't search a Word document the way you are currently doing. You need to create a Word.Application object as mentioned and then load the document so you can search it.

Here is a short example I wrote for you. Please note, you need to add reference to Microsoft.Office.Interop.Word and then you need to add the import statement to your class. For example Imports Microsoft.Office.Interop. Also this grabs each paragraph and then uses the range to look for the word you are searching for, if found it adds it to the list.

Note: Tried and tested - I had this in a button event, but put where you need it.

    Try
Dim objWordApp As Word.Application = Nothing
Dim objDoc As Word.Document = Nothing
Dim TextToFind As String = YOURTEXT
Dim TextRange As Word.Range = Nothing
Dim StringLines As New List(Of String)

objWordApp = CreateObject("Word.Application")

If objWordApp IsNot Nothing Then
objWordApp.Visible = False
objDoc = objWordApp.Documents.Open(FileName, )
End If

If objDoc IsNot Nothing Then

'loop through each paragraph in the document and get the range
For Each p As Word.Paragraph In objDoc.Paragraphs
TextRange = p.Range
TextRange.Find.ClearFormatting()

If TextRange.Find.Execute(TextToFind, ) Then
StringLines.Add(p.Range.Text)
End If
Next

If StringLines.Count > 0 Then
MessageBox.Show(String.Join(Environment.NewLine, StringLines.ToArray()))
End If

objDoc.Close()
objWordApp.Quit()

End If

Catch ex As Exception
'publish your exception?
End Try

Update to use Sentences - this will go through each paragraph and grab each sentence, then we can see if the word exists... The benefit of this is it's quicker because we get each paragraph and then search the sentences. We have to get the paragraph in order to get the sentences...

Try
Dim objWordApp As Word.Application = Nothing
Dim objDoc As Word.Document = Nothing
Dim TextToFind As String = "YOUR TEXT TO FIND"
Dim TextRange As Word.Range = Nothing
Dim StringLines As New List(Of String)
Dim SentenceCount As Integer = 0

objWordApp = CreateObject("Word.Application")

If objWordApp IsNot Nothing Then
objWordApp.Visible = False
objDoc = objWordApp.Documents.Open(FileName, )
End If

If objDoc IsNot Nothing Then

For Each p As Word.Paragraph In objDoc.Paragraphs
TextRange = p.Range
TextRange.Find.ClearFormatting()
SentenceCount = TextRange.Sentences.Count
If SentenceCount > 0 Then
Do Until SentenceCount = 0
Dim sentence As String = TextRange.Sentences.Item(SentenceCount).Text
If sentence.Contains(TextToFind) Then
StringLines.Add(sentence.Trim())
End If

SentenceCount -= 1
Loop
End If
Next

If StringLines.Count > 0 Then
MessageBox.Show(String.Join(Environment.NewLine, StringLines.ToArray()))
End If

objDoc.Close()
objWordApp.Quit()

End If

Catch ex As Exception
'publish your exception?
End Try

How to read a single line out of a word document

You are trying to read a docx or docm file, which is a zip archive. Word files are not plain text files, so you won't get anything meaningful treating them as such. You need to open the file with Word or another app that can read such files.

Fastest way to read word files

For text-only extracting you can search for <w:t> elements in the word file (docx is a zip archive
of xml files). Please check this assumptions (document data is in word/document.xml) with 7zip before
you use it.

// using System.IO.Compression;
// using System.Xml;

/// <summary>
/// Returns every paragraph in a word document.
/// </summary>
public IEnumerable<string> ExtractText(string filename)
{
// Open zip compressed xml files.
using var zip = ZipFile.OpenRead(filename);
// Search for document content.
using var stream = zip.GetEntry("word/document.xml")?.Open();
if (stream == null) { yield break; }
using var reader = XmlReader.Create(stream);
while (reader.Read())
{
// Search for <w:t> values in document.xml
if (reader.NodeType == XmlNodeType.Element && reader.LocalName == "t")
{
yield return reader.ReadElementContentAsString();
}
}
}

Usage:

foreach (var paragraph in ExtractText("test.docx"))
{
Console.WriteLine("READ A PARAGRAPH");
Console.WriteLine(paragraph);
}


Related Topics



Leave a reply



Submit