Easiest Way to Extract the Urls from an HTML Page Using Sed or Awk Only

How to extract a link from an html file using bash

You can do all of that with your native grep

This options may just be what you are looking for grep's man page:

-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)

-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

curl <URL> | grep -o -E "href=[\"'](.*)[\"'] "

The regular expression is extremely generic but you may be able to refine it to your needs

How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

$ sed -n 's/.*href="\([^"]*\).*/\1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

BASH extract link href from a string

For simple cases, you can use sed:

sed -r 's/.*href="([^"]+).*/\1/g'

extracting particular rows from a html file based on the column value using sed or awk

First i added newlines after each <\tr>

 os.system("sed 's/<\/TR>/&\\\n/g' /tmp/file_full.html > /tmp/file_formated.html")

then executing the following line we get the result.This line checks for the column value to be "ccc" and if so it is wriiten into a seperate file.

os.system('sed -n "/<TD>ccc<\/TD>/p" /tmp/file_formated.html > /tmp/file_ccc.html')

awk or sed extract each number of column alone

Following awk could help you in same.

awk '{print > NF"column.txt"}'  Input_file

Output it will create 3 files named 5column.txt, 4column.txt and 3column.txt.



Related Topics



Leave a reply



Submit