R: xmlEventParse with Large, Varying-node XML Input and Conversion to Data Frame
Setup a function that will create a temp storage area for our element data as well as a function that will be called every time a is found.
library(XML)
uid_traverse <- function() {
# we'll store them as character vectors and then make a data frame out of them.
# this is likely one of the cheapest & fastest methods despite growing a vector
# inch by inch. You can pre-allocate space and modify this idiom accordingly
# for another speedup.
uids <- c()
refs <- c()
REC <- function(x) {
uid <- xpathSApply(x, "//UID", xmlValue)
ref <- xpathSApply(x, "//reference/uid", xmlValue)
if (length(uid) > 0) {
if (length(ref) == 0) {
uids <<- c(uids, uid)
refs <<- c(refs, NA_character_)
} else {
uids <<- c(uids, rep(uid, length(ref)))
refs <<- c(refs, ref)
}
}
}
# we return a named list with the element handler and another
# function that turns the vectors into a data frame
list(
REC = REC,
uid_df = function() {
data.frame(uid = uids, ref = refs, stringsAsFactors = FALSE)
}
)
}
We need one instance of this function.
uid_f <- uid_traverse()
Now, we call xmlEventParse() and give it our function, using invisible() since we don’t need what xmlEventParse() returns but just want the side-effects:
invisible(
xmlEventParse(
file = path.expand("~/data/so.xml"),
branches = uid_f["REC"])
)
And, we see the results:
uid_f$uid_df()
## uid ref
## 1 ABCD123 ABCD2345
## 2 ABCD123 ABCD3456
## 3 ABCD123 ABCD4567
## 4 XYZ0987 <NA>
Combine values in huge XML-files
We'll use the XML package
library(XML)
and create a closure that contains a function to handle the 'SCHOOL' node, as well as two helper functions to retrieve results when done. The SCHOOL function is invoked on each SCHOOL node. If it finds a hockey team, it uses the /SCHOOL/NAME/text() as a 'key', and the /SCHOOL/TEAMS/HOCKEY/text() and //STUDENT/text() (or /SCHOOL/GRADES/STUDENT/text()) as values. A message is printed for every 100 (by default) schools with hockey teams, so that there's some indication of progress. The 'get' function is used after the fact to retrieve the result.
teams <- function(progress=1000) {
res <- new.env(parent=emptyenv()) # for results
it <- 0L # iterator -- nodes visited
list(SCHOOL=function(elt) {
## handle 'SCHOOL' nodes
if (getNodeSet(elt, "not(/SCHOOL/TEAMS/HOCKEY)"))
## early exit -- no hockey team
return(NULL)
it <<- it + 1L
if (it %% progress == 0L)
message(it)
school <- getNodeSet(elt, "string(/SCHOOL/NAME/text())") # 'key'
res[[school]] <-
list(team=getNodeSet(elt,
"normalize-space(/SCHOOL/TEAMS/HOCKEY/text())"),
students= xpathSApply(elt, "//STUDENT", xmlValue))
}, getres = function() {
## retrieve the 'res' environment when done
res
}, get=function() {
## retrieve 'res' environment as data.frame
school <- ls(res)
team <- unlist(eapply(res, "[[", "team"), use.names=FALSE)
student <- eapply(res, "[[", "students")
len <- sapply(student, length)
data.frame(school=rep(school, len), team=rep(team, len),
student=unlist(student, use.names=FALSE))
})
}
We use the function as
branches <- teams()
xmlEventParse("event.xml", handlers=NULL, branches=branches)
branches$get()
R: Memory Management during xmlEventParse of Huge ( 20GB) files
Here is an example, we have a launch script invoke.sh
, that calls an R Script and passes the url and filename as parameters... In this case, I had previously downloaded the test file medsamp2015.xml and put in the ./data
directory.
- My sense would be to create a loop in the
invoke.sh
script and iterate through the list of target file names. For each file you invoke an R instance, download it, process the file and move on to the next.
Caveat: I didn't check or change your function against any other download files and formats. I would turn off the printing of the output by removing the print() wrapper on line 62.
print( cbind(c(rep(v1, length(v2))), v2))
- See: runtime.txt for print out.
- The output
.csv
files are placed in the./data
directory.
Note: This is a derivative of a previous answer provided by me on this subject:
R memory not released in Windows. I hope it helps by way of example.
Launch Script
1 #!/usr/local/bin/bash -x
2
3 R --no-save -q --slave < ./47162861.R --args "https://www.nlm.nih.gov/databases/dtd" "medsamp2015.xml"
R File - 47162861.R
# Set working directory
projectDir <- "~/dev/stackoverflow/47162861"
setwd(projectDir)
# -----------------------------------------------------------------------------
# Load required Packages...
requiredPackages <- c("XML")
ipak <- function(pkg) {
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
ipak(requiredPackages)
# -----------------------------------------------------------------------------
# Load required Files
# trailingOnly=TRUE means that only your arguments are returned
args <- commandArgs(trailingOnly = TRUE)
if ( length(args) != 0 ) {
dataDir <- file.path(projectDir,"data")
fileUrl = args[1]
fileName = args[2]
} else {
dataDir <- file.path(projectDir,"data")
fileUrl <- "https://www.nlm.nih.gov/databases/dtd"
fileName <- "medsamp2015.xml"
}
# -----------------------------------------------------------------------------
# Download file
# Does the directory Exist? If it does'nt create it
if (!file.exists(dataDir)) {
dir.create(dataDir)
}
# Now we check if we have downloaded the data already if not we download it
if (!file.exists(file.path(dataDir, fileName))) {
download.file(fileUrl, file.path(dataDir, fileName), method = "wget")
}
# -----------------------------------------------------------------------------
# Now we extrat the data
tempdat <- data.frame(pmid = as.numeric(), lname = character(),
stringsAsFactors = FALSE)
cnt <- 1
branchFunction <- function() {
func <- function(x, ...) {
v1 <- xpathSApply(x, path = "//PMID", xmlValue)
v2 <- xpathSApply(x, path = "//Author/LastName", xmlValue)
print(cbind(c(rep(v1, length(v2))), v2))
# below is where I store/write the temp data along the way
# but even without doing this, memory is used (even after
# clearing)
tempdat <<- rbind(tempdat, cbind(c(rep(v1, length(v2))),
v2))
if (nrow(tempdat) > 1000) {
outname <- file.path(dataDir, paste0(cnt, ".csv")) # Create FileName
write.csv(tempdat, outname, row.names = F) # Write File to created directory
tempdat <<- data.frame(pmid = as.numeric(), lname = character(),
stringsAsFactors = FALSE)
cnt <<- cnt + 1
}
}
list(MedlineCitation = func)
}
myfunctions <- branchFunction()
# -----------------------------------------------------------------------------
# RUN
xmlEventParse(file = file.path(dataDir, fileName),
handlers = NULL,
branches = myfunctions)
Test File and output
~/dev/stackoverflow/47162861/data/medsamp2015.xml
$ ll
total 2128
drwxr-xr-x@ 7 hidden staff 238B Nov 10 11:05 .
drwxr-xr-x@ 9 hidden staff 306B Nov 10 11:11 ..
-rw-r--r--@ 1 hidden staff 32K Nov 10 11:12 1.csv
-rw-r--r--@ 1 hidden staff 20K Nov 10 11:12 2.csv
-rw-r--r--@ 1 hidden staff 23K Nov 10 11:12 3.csv
-rw-r--r--@ 1 hidden staff 37K Nov 10 11:12 4.csv
-rw-r--r--@ 1 hidden staff 942K Nov 10 11:05 medsamp2015.xml
Runtime Output
> ./invoke.sh > runtime.txt
+ R --no-save -q --slave --args https://www.nlm.nih.gov/databases/dtd medsamp2015.xml
Loading required package: XML
File: runtime.txt
How to read large (~20 GB) xml file in R?
In XML package the xmlEventParse
function implements SAX (reading XML and calling your function handlers). If your XML is simple enough (repeating elements inside one root element), you can use branches
parameter to define function(s) for every element.
Example:
MedlineCitation = function(x, ...) {
#This is a "branch" function
#x is a XML node - everything inside element <MedlineCitation>
# find element <ArticleTitle> inside and print it:
ns <- getNodeSet(x,path = "//ArticleTitle")
value <- xmlValue(ns[[1]])
print(value)
}
Call XML parsing:
xmlEventParse(
file = "http://www.nlm.nih.gov/databases/dtd/medsamp2015.xml",
handlers = NULL,
branches = list(MedlineCitation = MedlineCitation)
)
Solution with closure:
Like in Martin Morgan, Storing-specific-xml-node-values-with-rs-xmleventparse:
branchFunction <- function() {
store <- new.env()
func <- function(x, ...) {
ns <- getNodeSet(x, path = "//ArticleTitle")
value <- xmlValue(ns[[1]])
print(value)
# if storing something ...
# store[[some_key]] <- some_value
}
getStore <- function() { as.list(store) }
list(MedlineCitation = func, getStore=getStore)
}
myfunctions <- branchFunction()
xmlEventParse(
file = "medsamp2015.xml",
handlers = NULL,
branches = myfunctions
)
#to see what is inside
myfunctions$getStore()
Related Topics
Dplyr: Put Count Occurrences into New Variable
Create an Expression from a Function for Data.Table to Eval
Ggplot2 Theme with No Axes or Grid
How to Print R Variables in Middle of String
Clipping Raster Using Shapefile in R, But Keeping the Geometry of the Shapefile
How to Skip an Error in a Loop
Efficient Alternatives to Merge for Larger Data.Frames R
How to Install R Package from Private Repo Using Devtools Install_Github
How to Combine Multiple Ggplot2 Elements into the Return of a Function
Can't Load X11 in R After Os X Yosemite Upgrade
Create Convex Hull Polygon from Points and Save as Shapefile
Get Map with Specified Boundary Coordinates
Create Tables with Conditional Formatting with Rmarkdown + Knitr