Web Scraper 4 10 2010

broken image


  • Summary: Learn how to use Windows PowerShell 5.0 to scrape a web page so that you can easily return parsable objects. Ed Wilson here, and today I have a guest blog post by Doug Finke When surfing the PowerShell Gallery, you'll find that each module has a web page with a version history, for example.
  • Internet scrape capabilities by changing unstructured information gathered from webpages that are particular to show it in to the well- information that is o.
  • There are various tools over the internet which automatize web scraping and can be used even by non-programmers. But to have full control over the web page you need a tool that will match your needs, a tool you can write yourself. Do you know what the most popular browser over the Internet is? Chrome, of course.

The scraper is another easy-to-use screen web scraper that can easily extract data from an online table, and upload the result to Google Docs. Just select some text in a table or a list, right-click on the selected text and choose 'Scrape Similar' from the browser menu.

Software

Overview

What if you had an idea for an ecological study, but the data you needed wasn't available to you? What if you wanted to validate one of your measures by comparing your estimates to external sources? What do you do?

Well, for one, you could go and get the data online. Web scraping (web harvesting or web data extraction) is a computer software technique that allows you to extract information from websites. When you want to extract data from a document, you would copy and paste the elements you want. For a website, this is a little trickier because of the way the information is formatted and stored, typically as HTML code. Thus, scrapers work by parsing the HTML source code of a website in order to extract and retrieve specific elements within the page's code.

Description

Search engines use a specific type of scraper, called a web crawler or search bot, to crawl through web pages and identify which sites they link to and what terms they use. This could mean the first web scrapers were around in the early nineties.

Google and Facebook really brought scraping to another level. Google scraped the web to catalogue all of the information on the internet and make it accessible. Recently, Facebook has been using scrapers to help people find connections and fill out their social networks.

Legality

Well, that depends on what you think the meaning of 'legality' is. While early century court precedents set the tone for unscrupulous scraping of content, recent rulings have shifted towards a more conservative approach. Generally, if you have to agree to terms of consent, if the data is available for purchase, or if the data is behind a login, you are treading in a legal murky area. Even if none of these caveats are met, you might still be in hot water.

Ethics

Here are some general ethical issues to consider prior to scraping:

1) Respect the hosting site's wishes

Some websites may have instructions for bots and scrapers, outlining the elements that can be scraped and which elements are off limits. These sites have robot.txt files that disallow scraping of particular content. Also, if you have to agree to any terms and conditions, be sure to read them thoroughly. Check if an API exists or if the data is otherwise available for download or sale.

2) Respect the hosting site's bandwidth

Hosting websites costs money, and scraping takes up bandwidth. If you are familiar with Denial-of-service-attacks, scraping or sending bots to a website is similar. Write responsible programs that limit bandwidth use. Wait a few seconds between requests, and try to scrape during off-peak hours. Finally, scrape only what you need.

3) Respect the law

Some call it theft; some call it legitimate business practice. The fact that you can access the data doesn't mean you can use it for your research. Some data is more sensitive. In particular, time sensitive data is popular. For instance, a successful bookmaker may want to have their lines listed to the betting public, but they obviously wouldn't want their competitors to know. Read the terms of agreement if applicable, or just be more subversive.

Example Application

The following is a brief example of scraping data of one bedroom apartment listings in Manhattan using R. This code can easily be adapted for other apartment size, location, and other amenities by setting a different search filter on Naked Apartments and pasting the updated URL below.

1) Get the webpage URL

url <- 'http://www.nakedapartments.com/renter/listings/search?nids=23,211,6,21,203,191,194,18,24,76,204,205,10,14,195,1,5,25,93,206,22,17,207,13,155,16,72,2,9,20,19,73,7,208,209,192,8,74,210,11,4,3,26,212,12&aids=3&order=asc&sort=rent&page='

# set the maximum number of search result pages. Currently set at 800.

s <- as.character(seq(1,800,by=1))
urls <- paste0(url, s)

2) Scrape the lines of code

# load the libraries

require(RCurl)
require(XML)
library(stringr)

SOURCE <- getURL(urls,encoding='UTF-8″) # Specify encoding when dealing with non-latin characters

3) Parse the HTML code to isolate the data

PARSED <- htmlParse(SOURCE)

# price and neighborhood

listings <- (xpathSApply(PARSED, '[PATH]', xmlValue))

# trim white space

listings <- str_trim(listings)
listings <- strsplit(listings, ', ')
tabs <- matrix(unlist(listings), , 2, byrow=TRUE)
colnames(tabs) <- cbind('price', 'neighborhood')

# lat and long

lat <- (xpathSApply(PARSED, 'div[@id]/@data-latitude'))
long <- (xpathSApply(PARSED, 'div[@id]/@data-longitude'))
tabs1 <- cbind(tabs, lat, long)
row.names(tabs1) <- seq(nrow(tabs1))

Web Scraper 4 10 2010 Model

4) Clean and put elements into a dataframe

mydf <- data.frame(tabs1)
lats <- as.numeric(tabs1[,3])
longs <- as.numeric(tabs1[,4])

lats[lats0] <- NA
longs[longs0] <- NA

mydf[,3] <- lats
mydf[,4] <- longs

price <- mydf[,1]
price1 <- gsub('$', '', as.character(price), fixed=TRUE)
price2 <- gsub(',', '', as.character(price1), fixed=TRUE)
price3 <- as.numeric(price2)
mydf[,1] <- price3
head(mydf)

NEW <- mydf[complete.cases(mydf),]
table(complete.cases(NEW))

dat <- tapply(NEW$price, NEW$neighborhood, mean)
p <- as.matrix(dat)
p

p[order(p[,1]),]

Readings

Textbooks & Chapters

HANRETTY, C. 2013. Scraping the web for arts and humanities.

Articles

Scraper

NAN, X. Web scraping with R. In: ROAD2STAT, ed. 6th China R 2013 Beijing.

LEE, B. K. 2010. Epidemiologic research and Web 2.0–the user-driven Web. Epidemiology, 21,760-3.

SIGNORINI, A., SEGRE, A. M. & POLGREEN, P. M. 2011. The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic. PLoS One,6, e19467.

CUNNINGHAM, J. A. 2012. Using Twitter to measure behavior patterns. Epidemiology, 23, 764-5.

CHEW, C. & EYSENBACH, G. 2010. Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak. PLoS One, 5, e14118.

[On ethics: Screen scraping: how to profit from your rival's data]
http://www.bbc.co.uk/news/technology-23988890 Cookie 5 7 6 – protect your online privacy act.

[On ethics: Depends on what the meaning of the word 'illegal' means]
http://www.distilnetworks.com/is-web-scraping-illegal-depends-on-what-the-meaning-of-the-word-is-is

[On ethics – Felony charges for screen scraper]
http://www.forbes.com/sites/andygreenberg/2012/11/21/security-researchers-cry-foul-over-conviction-of-att-ipad-hacker/

[Programming with humanists: Reflections on raising an army of hacker-scholars]
http://blog.hartleybrody.com/web-scraping/http://openbookpublishers.com/htmlreader/DHP/chap09.html#ch09

Websites

[Charles DiMaggio on Web Scraping]
http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/styled-4/styled-6/code-13/

[Web scraping basics – Part I of III]
http://www.r-bloggers.com/web-scraping-in-r/

[Scraping Google Scholar]
http://www.r-bloggers.com/web-scraper-for-google-scholar-updated

[How to buy a used car with R]
http://www.r-bloggers.com/web-scraper-for-google-scholar-updated

[Commercial website for scrapers]
https://scraperwiki.com/

[Commercial website for scraped data]
http://scrapy.org/

Courses

A two-day EPIC course covers the Digital Acquisition of Big Data

BARBERA, P. NYU Politics Data Lab Workshop: Scraping Twitter and Web Data using R. Department of Politics, 2013 New York University

STARKWEATHER, J. 2013. Five easy steps for scraping data from web pages. Benchmarks RSS Matters.

C# Web Scraper

1.2 Web Scraping Can Be Ugly

Web Scraper 4 10 2010 Edition

Depending on what web sites you want to scrape the process can be involved and quite tedious. Many websites are very much aware that people are scraping so they offer Application Programming Interfaces (APIs) to make requests for information easier for the user and easier for the server administrators to control access. Most times the user must apply for a 'key' to gain access.

For premium sites, the key costs money. Some sites like Google and Wunderground (a popular weather site) allow some number of free accesses before they start charging you. Even so the results are typically returned in XML or JSON which then requires you to parse the result to get the information you want. In the best situation there is an R package that will wrap in the parsing and will return lists or data frames.

Wifispoof 3 4. Here is a summary:

  • First. Always try to find an R package that will access a site (e.g. New York Times, Wunderground, PubMed). These packages (e.g. omdbapi, easyPubMed, RBitCoin, rtimes) provide a programmatic search interface and return data frames with little to no effort on your part.

  • If no package exists then hopefully there is an API that allows you to query the website and get results back in JSON or XML. I prefer JSON because it's 'easier' and the packages for parsing JSON return lists which are native data structures to R. So you can easily turn results into data frames. You will ususally use the rvest package in conjunction with XML, and the RSJONIO packages.

  • If the Web site doesn't have an API then you will need to scrape text. This isn't hard but it is tedious. You will need to use rvest to parse HMTL elements. If you want to parse mutliple pages then you will need to use rvest to move to the other pages and possibly fill out forms. If there is a lot of Javascript then you might need to use RSelenium to programmatically manage the web page.





broken image