how to couple xargs with pdftotext converter to search inside multiple pdf files
xargs
is the wrong tool for this job: find
does everything you need built-in.
find ~/.personal/tips \
-type f \
-iname "*.pdf" \
-exec pdftotext '{}' - ';' \
| grep hot
That said, if you did want to use xargs
for some reason, correct usage would look something like...
find ~/.personal/tips \
-type f \
-iname "*.pdf" \
-print0 \
| xargs -0 -J % -n 1 pdftotext % - \
| grep hot
Note that:
- The
find
command uses-print0
to NUL-delimit its output - The
xargs
command uses-0
to NUL-delimit its input (which also turns off some behavior which would lead to incorrect handling of filenames with whitespace in their names, literal quote characters, etc). - The
xargs
command uses-n 1
to callpdftotext
once per file - The
xargs
command uses-J %
to specify a sigil for where the replacement should happen, and uses that%
in the pdftotext command line appropriately.
PDF Text extraction and storing them as key-value pair
When working with PDF files I prefer to work with PyMuPDF library https://pypi.org/project/PyMuPDF/
import fitz
txt = []
doc = fitz.open("Sample-Cert_rev-7-1.pdf") # some existing PDF
page = doc[0]
text = page.getText("text")
txt = list(text)
print(text)
text = text.split('\n')
txt = list(text)
print(txt)
ix = text.index('MPC Control #:')
print(ix)
print(text[ix+18])
Pay attention on how to install the library correctly
Here is the output:
"C:\Program Files\Python38\python.exe" C:/Python/stackoverflow extract_pdf_text1.py
MICRO PRECISION CALIBRATION, INC.
22835 INDUSTRIAL PLACE
GRASS VALLEY CA 95949
530-268-1860
Cert No.
551220083746791
Date: Aug 3, 2020
Certificate of Calibration
AC-1969.00
N/A
July 01, 2021
N/A
Customer:
MPC Control #:
Asset ID:
Gage Type:
Manufacturer:
Model Number:
Size:
Temp/RH:
Serial Number:
Department:
Performed By:
Received Condition:
Returned Condition:
Cal. Date:
Cal. Interval:
Cal. Due Date:
Work Order #:
DIGITAL MULTIMETER
DANNY BOY B. BUTIAL
0258964
0258964
NONE
AGILENT
34401A
10MHZ
SAMPLE
N/A
IN TOLERANCE
IN TOLERANCE
July 01, 2020
N/A
12 MONTHS
Calibration Notes:
SAMPLE COMPANY
23.0°C / 40.0%
Location:
Calibration performed at MPC facility
Standards Used to Calibrate Equipment
I.D.
Description.
Model
Serial
Manufacturer
Cal. Due Date
Traceability #
PH1405
MULTI-PRODUCT CALIBRATOR
5520A
7575006
FLUKE
Sep 10, 2020
551220083204793
AL4394
DIGITAL MULTIMETER
3458A
2823A09832
AGILENT
Aug 1, 2020
551220083719099
Procedures Used in this Event
Procedure Name
Description
MPC Automated Procedure
MPCCAL Rev. 00
STATEMENTS OF PASS OR FAIL CONFORMANCE: The uncertainty of measurement has been taken into account when determining compliance with specification. All measurements and test results guard banded to ensure the
probability of false-accept does not exceed 2% in compliance with ANSI/NCSL Z540.3-2006 and in case without guard banded the probability of false-accept depending on test uncertainty ratio.
THE CALIBRATION REPORT STATUS:
PASS- Term used when compliance statement is given, and the measurement result is PASS.
PASSz- Term used when compliance statement is given, and the measurement result is conditional passed or PASSz.
FAIL- Term used when compliance statement is given, and the measurement result is FAIL.
FAILz- Term used when compliance statement is given, and the measurement result is conditional failed or FAILz.
REPORT OF VALUE - Term used when reported measurement is not requiring compliance statement in report.
ADJUSTED- When adjustments are made to an instrument which changes the value of measurement from what was measured as found to new value as left.
LIMITED - When an instrument fails calibration but is still functional in a limited manner.
The expanded uncertainty of measurement is stated as the standard uncertainty of measurement multiplied by the coverage factor k=2, which for a normal distribution corresponds to a coverage probability of approximately 95%, unless otherwise stated. This
calibration report complies with ISO/IEC 17025:2017 and ANSI/NCSL Z540.3. Calibration cycles and resulting due dates were submitted/approved by the customer. Any number of factors may cause an instrument to drift out of tolerance before the next
scheduled calibration. Recalibration cycles should be based on frequency of use, environmental conditions and customer's established systematic accuracy. All standards are traceable to SI through the National Institute of Standards and Technology (NIST)
and/or recognized national or international standards laboratories. Services rendered include proper manufacturer’s service instruction and are warranted for no less than thirty (30) days. The information on this report pertains only to the instrument identified,
this may not be reproduced in part or in a whole without the prior written approval of the issuing MP Calibration Laboratory.
Rick Hernandez
Calibrating Technician:
QC Approval:
DANNY BOY B. BUTIAL
(CERT, Rev 7)
Page 1 of 1
['MICRO PRECISION CALIBRATION, INC.', '22835 INDUSTRIAL PLACE', 'GRASS VALLEY CA 95949', '530-268-1860', 'Cert No.', '551220083746791', 'Date: Aug 3, 2020', 'Certificate of Calibration', 'AC-1969.00', 'N/A', 'July 01, 2021', 'N/A', 'Customer:', 'MPC Control #:', 'Asset ID:', 'Gage Type:', 'Manufacturer:', 'Model Number:', 'Size:', 'Temp/RH:', 'Serial Number:', 'Department:', 'Performed By:', 'Received Condition:', 'Returned Condition:', 'Cal. Date:', 'Cal. Interval:', 'Cal. Due Date:', 'Work Order #:', 'DIGITAL MULTIMETER', 'DANNY BOY B. BUTIAL', '0258964', '0258964', 'NONE', 'AGILENT', '34401A', '10MHZ', 'SAMPLE', 'N/A', 'IN TOLERANCE', 'IN TOLERANCE', ' July 01, 2020', 'N/A', '12 MONTHS', 'Calibration Notes:', 'SAMPLE COMPANY', '23.0°C / 40.0%', 'Location:', 'Calibration performed at MPC facility', 'Standards Used to Calibrate Equipment', 'I.D.', 'Description.', 'Model', 'Serial', 'Manufacturer', 'Cal. Due Date', 'Traceability #', 'PH1405', 'MULTI-PRODUCT CALIBRATOR', '5520A', '7575006', 'FLUKE', 'Sep 10, 2020', '551220083204793', 'AL4394', 'DIGITAL MULTIMETER', '3458A', '2823A09832', 'AGILENT', 'Aug 1, 2020', '551220083719099', 'Procedures Used in this Event', 'Procedure Name', 'Description', 'MPC Automated Procedure', 'MPCCAL Rev. 00', 'STATEMENTS OF PASS OR FAIL CONFORMANCE: The uncertainty of measurement has been taken into account when determining compliance with specification. All measurements and test results guard banded to ensure the', 'probability of false-accept does not exceed 2% in compliance with ANSI/NCSL Z540.3-2006 and in case without guard banded the probability of false-accept depending on test uncertainty ratio.', 'THE CALIBRATION REPORT STATUS:', 'PASS- Term used when compliance statement is given, and the measurement result is PASS.', 'PASSz- Term used when compliance statement is given, and the measurement result is conditional passed or PASSz.', 'FAIL- Term used when compliance statement is given, and the measurement result is FAIL.', 'FAILz- Term used when compliance statement is given, and the measurement result is conditional failed or FAILz.', 'REPORT OF VALUE - Term used when reported measurement is not requiring compliance statement in report.', 'ADJUSTED- When adjustments are made to an instrument which changes the value of measurement from what was measured as found to new value as left.', 'LIMITED - When an instrument fails calibration but is still functional in a limited manner.', 'The expanded uncertainty of measurement is stated as the standard uncertainty of measurement multiplied by the coverage factor k=2, which for a normal distribution corresponds to a coverage probability of approximately 95%, unless otherwise stated. This', 'calibration report complies with ISO/IEC 17025:2017 and ANSI/NCSL Z540.3. Calibration cycles and resulting due dates were submitted/approved by the customer. Any number of factors may cause an instrument to drift out of tolerance before the next', "scheduled calibration. Recalibration cycles should be based on frequency of use, environmental conditions and customer's established systematic accuracy. All standards are traceable to SI through the National Institute of Standards and Technology (NIST)", 'and/or recognized national or international standards laboratories. Services rendered include proper manufacturer’s service instruction and are warranted for no less than thirty (30) days. The information on this report pertains only to the instrument identified,', 'this may not be reproduced in part or in a whole without the prior written approval of the issuing MP Calibration Laboratory.', 'Rick Hernandez', 'Calibrating Technician:', 'QC Approval:', 'DANNY BOY B. BUTIAL', '(CERT, Rev 7)', 'Page 1 of 1', '']
13
0258964
Process finished with exit code 0
Propagate value of variable to outside of the loop
The problem is the pipe, not the loop. Try it this way
let i=0
arr=()
_constr=
while read -r line ; do
arr=("${line}")
let i=i+1
_constr+="${arr[2]} "
done < <(dpkg --list | grep linux-image | grep 'ii ')
echo "$i"
echo "${_constr}"
Pipes are executed in a subshell, as noted by Blagovest in his comment. Using process substitution instead (this is the < <(commands)
syntax) keeps everything in the same process, so changes to global variables are possible.
Incidentally, your pipeline could be improved as well
dpkg --list | grep '^ii.*linux-image'
One less invocation of grep
to worry about.
Propagate value of variable to outside of the loop
The problem is the pipe, not the loop. Try it this way
let i=0
arr=()
_constr=
while read -r line ; do
arr=("${line}")
let i=i+1
_constr+="${arr[2]} "
done < <(dpkg --list | grep linux-image | grep 'ii ')
echo "$i"
echo "${_constr}"
Pipes are executed in a subshell, as noted by Blagovest in his comment. Using process substitution instead (this is the < <(commands)
syntax) keeps everything in the same process, so changes to global variables are possible.
Incidentally, your pipeline could be improved as well
dpkg --list | grep '^ii.*linux-image'
One less invocation of grep
to worry about.
Related Topics
Systemtap Script to Profile Latency of Functions
How to Check If a UId Exists in an Acl in Linux
How to Find Size of Heap Present in Linux
Moving a Git Repo to Another Server
How to View Function Names and Parameters Contained in an Elf File
Prevent Git Checkout from Overwriting a File
How to Convert a Text/Plain to Text/X.Shellscript
Awk Command to Create Sha2 of Individual Column and Paste into New File
Qwidget/X11: Prevent Window from Beeing Activated/Focussed by Mouse Clicks
How to Do Simple Arithmetic in Sed Addresses
Create and Test X86-64 Elf Executable Shellcode on a Linux Machine
Interpreting Openssl Speed Output for Rsa with Multi Option
Bash Script Counting Instances of Itself Wrongly