is there a way to read a word document line by line
I would suggest following the code on this page here
The crux of it is that you read it with a Word.ApplicationClass (Microsoft.Interop.Word) object, although where he's getting the "Doc" object is beyond me. I would assume you create it with the ApplicationClass.
EDIT: Document is retrieved by calling this:
Word.Document doc = wordApp.Documents.Open(ref file, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj);
Sadly the formatting of the code on the page I linked wasn't all to easy.
EDIT2: From there you can loop through doc paragraphs, however as far as I can see there is no way of looping through lines. I would suggest using some pattern matching to find linebreaks.
In order to extract the text from a paragraph, use Word.Paragraph.Range.Text, this will return all the text inside a paragraph. Then you must search for linebreak characters. I'd use string.IndexOf().
Alternatively, if by lines you want to extract one sentence at a time, you can simply iterate through Range.Sentences
Read from word document line by line
Ok. I found the solution here.
The final code is as follows:
Application word = new Application();
Document doc = new Document();
object fileName = path;
// Define an object to pass to the API for missing parameters
object missing = System.Type.Missing;
doc = word.Documents.Open(ref fileName,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing);
String read = string.Empty;
List<string> data = new List<string>();
for (int i = 0; i < doc.Paragraphs.Count; i++)
{
string temp = doc.Paragraphs[i + 1].Range.Text.Trim();
if (temp != string.Empty)
data.Add(temp);
}
((_Document)doc).Close();
((_Application)word).Quit();
GridView1.DataSource = data;
GridView1.DataBind();
How to do check the paragraphs content to read a .docx file line-by-line in c#
OK, Plan B.
The database: a table for each chapter is bad design. I therefore used one table for all of it.
This one is a bit quick and dirty. Normally you would have one table for the chapters, one for the subchapters with a column for chapter ID. I recommend improving this once it works.
This is SQLite but you can easily adapt this to InnoDb:
CREATE TABLE Chapters (
sID integer PRIMARY KEY AUTOINCREMENT,
chapter text NOT NULL,
subheading1 text NOT NULL,
contents1 text NULL,
subheading2 text NOT NULL,
contents2 text NULL
)
Since we are basically handling plain text, let us reduce Interop to a bare minimum and do the rest with regex:
var wdApp = new Word.Application();
var doc = wdApp.Documents.Open(@"D:\00_Projekte_temp\Lorem ipsum.docx");
var txt = doc.Content.Text;
doc.Close(false);
wdApp.Quit();
var rex = new Regex(@"(Chapter[\s\t])(.+?)([\r\n]+?)(\s?\-\s?)(.+?[\r\n]+?)(.+?)([\r\n]+?)(\-\s)(.+?[\r\n]+?)(.+?[\r\n])");
var mCol = rex.Matches(txt);
foreach (Match m in mCol)
{
var chap = m.Groups[2].Value;
var subh1 = m.Groups[5].Value;
var cont1 = m.Groups[6].Value;
var subh2 = m.Groups[9].Value;
var cont2 = m.Groups[10].Value;
//write to db
var strSql = @"INSERT INTO Chapters (chapter, subheading1, contents1, subheading2, contents2) VALUES ($chap, $sub1, $con1, $sub2, $con2)";
using (var con = new SQLiteConnection("Data Source =\"D:\\00_Projekte_temp\\wordtest.db\";Version=3"))
{
con.Open();
using (var cmd = new SQLiteCommand(strSql, con))
{
cmd.Parameters.AddWithValue("$chap", chap);
cmd.Parameters.AddWithValue("$sub1", subh1);
cmd.Parameters.AddWithValue("$con1", cont1);
cmd.Parameters.AddWithValue("$sub2", subh2);
cmd.Parameters.AddWithValue("$con2", cont2);
cmd.ExecuteNonQuery();
}
con.Close();
}
}
I also recommend for the future that either your authors send you plain text directly or that you move from Interop to OpenXml, since that makes you independent from Word and thus also runnable on a server.
VB.Net: Searching Word Document By Line
As already mentioned in the comments, you can't search a Word
document the way you are currently doing. You need to create a Word.Application
object as mentioned and then load the document so you can search it.
Here is a short example I wrote for you. Please note, you need to add reference to Microsoft.Office.Interop.Word and then you need to add the import statement to your class. For example Imports Microsoft.Office.Interop
. Also this grabs each paragraph and then uses the range to look for the word you are searching for, if found it adds it to the list.
Note: Tried and tested - I had this in a button event, but put where you need it.
Try
Dim objWordApp As Word.Application = Nothing
Dim objDoc As Word.Document = Nothing
Dim TextToFind As String = YOURTEXT
Dim TextRange As Word.Range = Nothing
Dim StringLines As New List(Of String)
objWordApp = CreateObject("Word.Application")
If objWordApp IsNot Nothing Then
objWordApp.Visible = False
objDoc = objWordApp.Documents.Open(FileName, )
End If
If objDoc IsNot Nothing Then
'loop through each paragraph in the document and get the range
For Each p As Word.Paragraph In objDoc.Paragraphs
TextRange = p.Range
TextRange.Find.ClearFormatting()
If TextRange.Find.Execute(TextToFind, ) Then
StringLines.Add(p.Range.Text)
End If
Next
If StringLines.Count > 0 Then
MessageBox.Show(String.Join(Environment.NewLine, StringLines.ToArray()))
End If
objDoc.Close()
objWordApp.Quit()
End If
Catch ex As Exception
'publish your exception?
End Try
Update to use Sentences - this will go through each paragraph and grab each sentence, then we can see if the word exists... The benefit of this is it's quicker because we get each paragraph and then search the sentences. We have to get the paragraph in order to get the sentences...
Try
Dim objWordApp As Word.Application = Nothing
Dim objDoc As Word.Document = Nothing
Dim TextToFind As String = "YOUR TEXT TO FIND"
Dim TextRange As Word.Range = Nothing
Dim StringLines As New List(Of String)
Dim SentenceCount As Integer = 0
objWordApp = CreateObject("Word.Application")
If objWordApp IsNot Nothing Then
objWordApp.Visible = False
objDoc = objWordApp.Documents.Open(FileName, )
End If
If objDoc IsNot Nothing Then
For Each p As Word.Paragraph In objDoc.Paragraphs
TextRange = p.Range
TextRange.Find.ClearFormatting()
SentenceCount = TextRange.Sentences.Count
If SentenceCount > 0 Then
Do Until SentenceCount = 0
Dim sentence As String = TextRange.Sentences.Item(SentenceCount).Text
If sentence.Contains(TextToFind) Then
StringLines.Add(sentence.Trim())
End If
SentenceCount -= 1
Loop
End If
Next
If StringLines.Count > 0 Then
MessageBox.Show(String.Join(Environment.NewLine, StringLines.ToArray()))
End If
objDoc.Close()
objWordApp.Quit()
End If
Catch ex As Exception
'publish your exception?
End Try
How to read a single line out of a word document
You are trying to read a docx or docm file, which is a zip archive. Word files are not plain text files, so you won't get anything meaningful treating them as such. You need to open the file with Word or another app that can read such files.
Fastest way to read word files
For text-only extracting you can search for <w:t>
elements in the word file (docx is a zip archive
of xml files). Please check this assumptions (document data is in word/document.xml) with 7zip before
you use it.
// using System.IO.Compression;
// using System.Xml;
/// <summary>
/// Returns every paragraph in a word document.
/// </summary>
public IEnumerable<string> ExtractText(string filename)
{
// Open zip compressed xml files.
using var zip = ZipFile.OpenRead(filename);
// Search for document content.
using var stream = zip.GetEntry("word/document.xml")?.Open();
if (stream == null) { yield break; }
using var reader = XmlReader.Create(stream);
while (reader.Read())
{
// Search for <w:t> values in document.xml
if (reader.NodeType == XmlNodeType.Element && reader.LocalName == "t")
{
yield return reader.ReadElementContentAsString();
}
}
}
Usage:
foreach (var paragraph in ExtractText("test.docx"))
{
Console.WriteLine("READ A PARAGRAPH");
Console.WriteLine(paragraph);
}
Related Topics
How to Catch All Exceptions/Crashes in a .Net App
How to Get Property Change Notifications with Ef 4.X Dbcontext Generator
Which Blocking Operations Cause an Sta Thread to Pump Com Messages
Modifying a JSON File Using System.Text.JSON
Finding a Subsequence in Longer Sequence
Xml Deserialization of Collection Property with Code Defaults
Custom Authentication in ASP.NET-Core
Is There a Messagebox Equivalent in Wpf
Having the Output of a Console Application in Visual Studio Instead of the Console
Why Is Linq .Where(Predicate).First() Faster Than .First(Predicate)
Sending Windows Key Using Sendkeys
Garbage Collection When Using Anonymous Delegates for Event Handling
Using Tfs API, How to Find the Comments Which Were Made on a Code Review
Selectively Use Default JSON Converter
Jquery Ajax Calls to Web Service Seem to Be Synchronous