How to "Scan" a Website (Or Page) For Info, and Bring It into My Program

How to scan a website (or page) for info, and bring it into my program?

Use a HTML parser like Jsoup. This has my preference above the other HTML parsers available in Java since it supports jQuery like CSS selectors. Also, its class representing a list of nodes, Elements, implements Iterable so that you can iterate over it in an enhanced for loop (so there's no need to hassle with verbose Node and NodeList like classes in the average Java DOM parser).

Here's a basic kickoff example (just put the latest Jsoup JAR file in classpath):

package com.stackoverflow.q2835505;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

public static void main(String[] args) throws Exception {
String url = "https://stackoverflow.com/questions/2835505";
Document document = Jsoup.connect(url).get();

String question = document.select("#question .post-text").text();
System.out.println("Question: " + question);

Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.text());
}
}

}

As you might have guessed, this prints your own question and the names of all answerers.

Program to scan my website and find all the pages which link to an external website

Eventually, the easiest way was with an SQL Query in the database, so make sure you check if it is do able when you have a similar problem, via SQL before trying to find external programs.

Thanks to everyone for suggestions.

Using Python to sign into website, fill in a form, then sign out

import urllib
import urllib2

name = "name field"
data = {
"name" : name
}

encoded_data = urllib.urlencode(data)
content = urllib2.urlopen("http://www.abc.com/messages.php?action=send",
encoded_data)
print content.readlines()

just replace http://www.abc.com/messages.php?action=send with the url where your form is being submitted

reply to your comment: if the url is the url where your form is located, and you need to do this just for one website, look at the source code of the page and find

<form method="POST" action="some_address.php">

and put this address as parameter for urllib2.urlopen

And you have to realise what submit button does.
It just send a Http request to the url defined by action in the form.
So what you do is to simulate this request with urllib2

How to identify if a webpage is being loaded inside an iframe or directly into the browser window?

Browsers can block access to window.top due to same origin policy. IE bugs also take place. Here's the working code:

function inIframe () {
try {
return window.self !== window.top;
} catch (e) {
return true;
}
}

top and self are both window objects (along with parent), so you're seeing if your window is the top window.

How to download a full webpage with a Python script?

The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth variable for you to set the maximum sub_websites that you want to parse to.

import urllib2
from BeautifulSoup import *
from urlparse import urljoin


def crawl(pages, depth=None):
indexed_url = [] # a list for the main and sub-HTML websites in the main website
for i in range(depth):
for page in pages:
if page not in indexed_url:
indexed_url.append(page)
try:
c = urllib2.urlopen(page)
except:
print "Could not open %s" % page
continue
soup = BeautifulSoup(c.read())
links = soup('a') #finding all the sub_links
for link in links:
if 'href' in dict(link.attrs):
url = urljoin(page, link['href'])
if url.find("'") != -1:
continue
url = url.split('#')[0]
if url[0:4] == 'http':
indexed_url.append(url)
pages = indexed_url
return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=2)
print urls

Python3 version, 2019. May this saves some time to somebody:

#!/usr/bin/env python


import urllib.request as urllib2
from bs4 import *
from urllib.parse import urljoin


def crawl(pages, depth=None):
indexed_url = [] # a list for the main and sub-HTML websites in the main website
for i in range(depth):
for page in pages:
if page not in indexed_url:
indexed_url.append(page)
try:
c = urllib2.urlopen(page)
except:
print( "Could not open %s" % page)
continue
soup = BeautifulSoup(c.read())
links = soup('a') #finding all the sub_links
for link in links:
if 'href' in dict(link.attrs):
url = urljoin(page, link['href'])
if url.find("'") != -1:
continue
url = url.split('#')[0]
if url[0:4] == 'http':
indexed_url.append(url)
pages = indexed_url
return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=1)
print( urls )

Can a website detect when you are using Selenium with chromedriver?

Replacing cdc_ string

You can use vim or perl to replace the cdc_ string in chromedriver. See answer by @Erti-Chris Eelmaa to learn more about that string and how it's a detection point.

Using vim or perl prevents you from having to recompile source code or use a hex-editor.

Make sure to make a copy of the original chromedriver before attempting to edit it.

Our goal is to alter the cdc_ string, which looks something like $cdc_lasutopfhvcZLmcfl.

The methods below were tested on chromedriver version 2.41.578706.



Using Vim

vim /path/to/chromedriver

After running the line above, you'll probably see a bunch of gibberish. Do the following:

  1. Replace all instances of cdc_ with dog_ by typing :%s/cdc_/dog_/g.
    • dog_ is just an example. You can choose anything as long as it has the same amount of characters as the search string (e.g., cdc_), otherwise the chromedriver will fail.
  2. To save the changes and quit, type :wq! and press return.
    • If you need to quit without saving changes, type :q! and press return.


Using Perl

The line below replaces all cdc_ occurrences with dog_. Credit to Vic Seedoubleyew:

perl -pi -e 's/cdc_/dog_/g' /path/to/chromedriver

Make sure that the replacement string (e.g., dog_) has the same number of characters as the search string (e.g., cdc_), otherwise the chromedriver will fail.



Wrapping Up

To verify that all occurrences of cdc_ were replaced:

grep "cdc_" /path/to/chromedriver

If no output was returned, the replacement was successful.

Go to the altered chromedriver and double click on it. A terminal window should open up. If you don't see killed in the output, you've successfully altered the driver.

Make sure that the name of the altered chromedriver binary is chromedriver, and that the original binary is either moved from its original location or renamed.



My Experience With This Method

I was previously being detected on a website while trying to log in, but after replacing cdc_ with an equal sized string, I was able to log in. Like others have said though, if you've already been detected, you might get blocked for a plethora of other reasons even after using this method. So you may have to try accessing the site that was detecting you using a VPN, different network, etc.

How to tell if a web application is using ReactJs

try the below snippet, thanks for the examples for each site listed by rambabusaravanan. See the below link

if(!!window.React ||
!!document.querySelector('[data-reactroot], [data-reactid]'))
console.log('React.js');

if(!!window.angular ||
!!document.querySelector('.ng-binding, [ng-app], [data-ng-app], [ng-controller], [data-ng-controller], [ng-repeat], [data-ng-repeat]') ||
!!document.querySelector('script[src*="angular.js"], script[src*="angular.min.js"]'))
console.log('Angular.js');

if(!!window.Backbone) console.log('Backbone.js');
if(!!window.Ember) console.log('Ember.js');
if(!!window.Vue) console.log('Vue.js');
if(!!window.Meteor) console.log('Meteor.js');
if(!!window.Zepto) console.log('Zepto.js');
if(!!window.jQuery) console.log('jQuery.js');

you can find additional info here link



Related Topics



Leave a reply



Submit