How to scan a website (or page) for info, and bring it into my program?
Use a HTML parser like Jsoup. This has my preference above the other HTML parsers available in Java since it supports jQuery like CSS selectors. Also, its class representing a list of nodes, Elements
, implements Iterable
so that you can iterate over it in an enhanced for loop (so there's no need to hassle with verbose Node
and NodeList
like classes in the average Java DOM parser).
Here's a basic kickoff example (just put the latest Jsoup JAR file in classpath):
package com.stackoverflow.q2835505;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) throws Exception {
String url = "https://stackoverflow.com/questions/2835505";
Document document = Jsoup.connect(url).get();
String question = document.select("#question .post-text").text();
System.out.println("Question: " + question);
Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.text());
}
}
}
As you might have guessed, this prints your own question and the names of all answerers.
Program to scan my website and find all the pages which link to an external website
Eventually, the easiest way was with an SQL Query in the database, so make sure you check if it is do able when you have a similar problem, via SQL before trying to find external programs.
Thanks to everyone for suggestions.
Using Python to sign into website, fill in a form, then sign out
import urllib
import urllib2
name = "name field"
data = {
"name" : name
}
encoded_data = urllib.urlencode(data)
content = urllib2.urlopen("http://www.abc.com/messages.php?action=send",
encoded_data)
print content.readlines()
just replace http://www.abc.com/messages.php?action=send
with the url where your form is being submitted
reply to your comment: if the url is the url where your form is located, and you need to do this just for one website, look at the source code of the page and find
<form method="POST" action="some_address.php">
and put this address as parameter for urllib2.urlopen
And you have to realise what submit
button does.
It just send a Http
request to the url defined by action
in the form.
So what you do is to simulate this request with urllib2
How to identify if a webpage is being loaded inside an iframe or directly into the browser window?
Browsers can block access to window.top
due to same origin policy. IE bugs also take place. Here's the working code:
function inIframe () {
try {
return window.self !== window.top;
} catch (e) {
return true;
}
}
top
and self
are both window
objects (along with parent
), so you're seeing if your window is the top window.
How to download a full webpage with a Python script?
The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth
variable for you to set the maximum sub_websites that you want to parse to.
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
def crawl(pages, depth=None):
indexed_url = [] # a list for the main and sub-HTML websites in the main website
for i in range(depth):
for page in pages:
if page not in indexed_url:
indexed_url.append(page)
try:
c = urllib2.urlopen(page)
except:
print "Could not open %s" % page
continue
soup = BeautifulSoup(c.read())
links = soup('a') #finding all the sub_links
for link in links:
if 'href' in dict(link.attrs):
url = urljoin(page, link['href'])
if url.find("'") != -1:
continue
url = url.split('#')[0]
if url[0:4] == 'http':
indexed_url.append(url)
pages = indexed_url
return indexed_url
pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=2)
print urls
Python3 version, 2019. May this saves some time to somebody:
#!/usr/bin/env python
import urllib.request as urllib2
from bs4 import *
from urllib.parse import urljoin
def crawl(pages, depth=None):
indexed_url = [] # a list for the main and sub-HTML websites in the main website
for i in range(depth):
for page in pages:
if page not in indexed_url:
indexed_url.append(page)
try:
c = urllib2.urlopen(page)
except:
print( "Could not open %s" % page)
continue
soup = BeautifulSoup(c.read())
links = soup('a') #finding all the sub_links
for link in links:
if 'href' in dict(link.attrs):
url = urljoin(page, link['href'])
if url.find("'") != -1:
continue
url = url.split('#')[0]
if url[0:4] == 'http':
indexed_url.append(url)
pages = indexed_url
return indexed_url
pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=1)
print( urls )
Can a website detect when you are using Selenium with chromedriver?
Replacing cdc_
string
You can use vim
or perl
to replace the cdc_
string in chromedriver
. See answer by @Erti-Chris Eelmaa to learn more about that string and how it's a detection point.
Using vim
or perl
prevents you from having to recompile source code or use a hex-editor.
Make sure to make a copy of the original chromedriver
before attempting to edit it.
Our goal is to alter the cdc_
string, which looks something like $cdc_lasutopfhvcZLmcfl
.
The methods below were tested on chromedriver version 2.41.578706
.
Using Vim
vim /path/to/chromedriver
After running the line above, you'll probably see a bunch of gibberish. Do the following:
- Replace all instances of
cdc_
withdog_
by typing:%s/cdc_/dog_/g
.dog_
is just an example. You can choose anything as long as it has the same amount of characters as the search string (e.g.,cdc_
), otherwise thechromedriver
will fail.
- To save the changes and quit, type
:wq!
and pressreturn
.- If you need to quit without saving changes, type
:q!
and pressreturn
.
- If you need to quit without saving changes, type
Using Perl
The line below replaces all cdc_
occurrences with dog_
. Credit to Vic Seedoubleyew:
perl -pi -e 's/cdc_/dog_/g' /path/to/chromedriver
Make sure that the replacement string (e.g., dog_
) has the same number of characters as the search string (e.g., cdc_
), otherwise the chromedriver
will fail.
Wrapping Up
To verify that all occurrences of cdc_
were replaced:
grep "cdc_" /path/to/chromedriver
If no output was returned, the replacement was successful.
Go to the altered chromedriver
and double click on it. A terminal window should open up. If you don't see killed
in the output, you've successfully altered the driver.
Make sure that the name of the altered chromedriver
binary is chromedriver
, and that the original binary is either moved from its original location or renamed.
My Experience With This Method
I was previously being detected on a website while trying to log in, but after replacing cdc_
with an equal sized string, I was able to log in. Like others have said though, if you've already been detected, you might get blocked for a plethora of other reasons even after using this method. So you may have to try accessing the site that was detecting you using a VPN, different network, etc.
How to tell if a web application is using ReactJs
try the below snippet, thanks for the examples for each site listed by rambabusaravanan. See the below link
if(!!window.React ||
!!document.querySelector('[data-reactroot], [data-reactid]'))
console.log('React.js');
if(!!window.angular ||
!!document.querySelector('.ng-binding, [ng-app], [data-ng-app], [ng-controller], [data-ng-controller], [ng-repeat], [data-ng-repeat]') ||
!!document.querySelector('script[src*="angular.js"], script[src*="angular.min.js"]'))
console.log('Angular.js');
if(!!window.Backbone) console.log('Backbone.js');
if(!!window.Ember) console.log('Ember.js');
if(!!window.Vue) console.log('Vue.js');
if(!!window.Meteor) console.log('Meteor.js');
if(!!window.Zepto) console.log('Zepto.js');
if(!!window.jQuery) console.log('jQuery.js');
you can find additional info here link
Related Topics
How to Create Recyclerview With Multiple View Types
How to Redirect Multiple Types of Users to Their Respective Activities
Android/Java - Date Difference in Days
How to Create a Java String from the Contents of a File
Why Don't Java'S +=, -=, *=, /= Compound Assignment Operators Require Casting
Using Jfreechart to Display Recent Changes in a Time Series
Why Is Processing a Sorted Array Faster Than Processing an Unsorted Array
Why Does My Function That Calls an API or Launches a Coroutine Return an Empty or Null Value
Android Update 17 Seems Incompatible With External Jars
How to Count the Number of Documents Under a Collection in Firestore
Distinguishing Between Java Threads and Os Threads
Why Use Getters and Setters/Accessors
What Is a Serialversionuid and Why Should I Use It
Error Java.Lang.Outofmemoryerror: Gc Overhead Limit Exceeded