Storing Specific Xml Node Values with R's Xmleventparse

R: xmlEventParse with Large, Varying-node XML Input and Conversion to Data Frame

Setup a function that will create a temp storage area for our element data as well as a function that will be called every time a is found.

library(XML)

uid_traverse <- function() {

# we'll store them as character vectors and then make a data frame out of them.
# this is likely one of the cheapest & fastest methods despite growing a vector
# inch by inch. You can pre-allocate space and modify this idiom accordingly
# for another speedup.

uids <- c()
refs <- c()

REC <- function(x) {

uid <- xpathSApply(x, "//UID", xmlValue)
ref <- xpathSApply(x, "//reference/uid", xmlValue)

if (length(uid) > 0) {

if (length(ref) == 0) {

uids <<- c(uids, uid)
refs <<- c(refs, NA_character_)

} else {

uids <<- c(uids, rep(uid, length(ref)))
refs <<- c(refs, ref)

}

}

}

# we return a named list with the element handler and another
# function that turns the vectors into a data frame

list(
REC = REC,
uid_df = function() {
data.frame(uid = uids, ref = refs, stringsAsFactors = FALSE)
}
)

}

We need one instance of this function.

uid_f <- uid_traverse()

Now, we call xmlEventParse() and give it our function, using invisible() since we don’t need what xmlEventParse() returns but just want the side-effects:

invisible(
xmlEventParse(
file = path.expand("~/data/so.xml"),
branches = uid_f["REC"])
)

And, we see the results:

uid_f$uid_df()
## uid ref
## 1 ABCD123 ABCD2345
## 2 ABCD123 ABCD3456
## 3 ABCD123 ABCD4567
## 4 XYZ0987 <NA>

Combine values in huge XML-files

We'll use the XML package

library(XML)

and create a closure that contains a function to handle the 'SCHOOL' node, as well as two helper functions to retrieve results when done. The SCHOOL function is invoked on each SCHOOL node. If it finds a hockey team, it uses the /SCHOOL/NAME/text() as a 'key', and the /SCHOOL/TEAMS/HOCKEY/text() and //STUDENT/text() (or /SCHOOL/GRADES/STUDENT/text()) as values. A message is printed for every 100 (by default) schools with hockey teams, so that there's some indication of progress. The 'get' function is used after the fact to retrieve the result.

teams <- function(progress=1000) {
res <- new.env(parent=emptyenv()) # for results
it <- 0L # iterator -- nodes visited
list(SCHOOL=function(elt) {
## handle 'SCHOOL' nodes
if (getNodeSet(elt, "not(/SCHOOL/TEAMS/HOCKEY)"))
## early exit -- no hockey team
return(NULL)
it <<- it + 1L
if (it %% progress == 0L)
message(it)
school <- getNodeSet(elt, "string(/SCHOOL/NAME/text())") # 'key'
res[[school]] <-
list(team=getNodeSet(elt,
"normalize-space(/SCHOOL/TEAMS/HOCKEY/text())"),
students= xpathSApply(elt, "//STUDENT", xmlValue))
}, getres = function() {
## retrieve the 'res' environment when done
res
}, get=function() {
## retrieve 'res' environment as data.frame
school <- ls(res)
team <- unlist(eapply(res, "[[", "team"), use.names=FALSE)
student <- eapply(res, "[[", "students")
len <- sapply(student, length)
data.frame(school=rep(school, len), team=rep(team, len),
student=unlist(student, use.names=FALSE))
})
}

We use the function as

branches <- teams()
xmlEventParse("event.xml", handlers=NULL, branches=branches)
branches$get()

R: Memory Management during xmlEventParse of Huge ( 20GB) files

Here is an example, we have a launch script invoke.sh, that calls an R Script and passes the url and filename as parameters... In this case, I had previously downloaded the test file medsamp2015.xml and put in the ./data directory.

  • My sense would be to create a loop in the invoke.sh script and iterate through the list of target file names. For each file you invoke an R instance, download it, process the file and move on to the next.

Caveat: I didn't check or change your function against any other download files and formats. I would turn off the printing of the output by removing the print() wrapper on line 62.

print( cbind(c(rep(v1, length(v2))), v2))
  • See: runtime.txt for print out.
  • The output .csv files are placed in the ./data directory.

Note: This is a derivative of a previous answer provided by me on this subject:
R memory not released in Windows. I hope it helps by way of example.

Launch Script

  1 #!/usr/local/bin/bash -x
2
3 R --no-save -q --slave < ./47162861.R --args "https://www.nlm.nih.gov/databases/dtd" "medsamp2015.xml"

R File - 47162861.R

# Set working directory

projectDir <- "~/dev/stackoverflow/47162861"
setwd(projectDir)

# -----------------------------------------------------------------------------
# Load required Packages...
requiredPackages <- c("XML")

ipak <- function(pkg) {
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}

ipak(requiredPackages)

# -----------------------------------------------------------------------------
# Load required Files
# trailingOnly=TRUE means that only your arguments are returned
args <- commandArgs(trailingOnly = TRUE)

if ( length(args) != 0 ) {
dataDir <- file.path(projectDir,"data")
fileUrl = args[1]
fileName = args[2]
} else {
dataDir <- file.path(projectDir,"data")
fileUrl <- "https://www.nlm.nih.gov/databases/dtd"
fileName <- "medsamp2015.xml"
}

# -----------------------------------------------------------------------------
# Download file

# Does the directory Exist? If it does'nt create it
if (!file.exists(dataDir)) {
dir.create(dataDir)
}

# Now we check if we have downloaded the data already if not we download it

if (!file.exists(file.path(dataDir, fileName))) {
download.file(fileUrl, file.path(dataDir, fileName), method = "wget")
}

# -----------------------------------------------------------------------------
# Now we extrat the data

tempdat <- data.frame(pmid = as.numeric(), lname = character(),
stringsAsFactors = FALSE)
cnt <- 1

branchFunction <- function() {
func <- function(x, ...) {
v1 <- xpathSApply(x, path = "//PMID", xmlValue)
v2 <- xpathSApply(x, path = "//Author/LastName", xmlValue)
print(cbind(c(rep(v1, length(v2))), v2))

# below is where I store/write the temp data along the way
# but even without doing this, memory is used (even after
# clearing)

tempdat <<- rbind(tempdat, cbind(c(rep(v1, length(v2))),
v2))
if (nrow(tempdat) > 1000) {
outname <- file.path(dataDir, paste0(cnt, ".csv")) # Create FileName
write.csv(tempdat, outname, row.names = F) # Write File to created directory
tempdat <<- data.frame(pmid = as.numeric(), lname = character(),
stringsAsFactors = FALSE)
cnt <<- cnt + 1
}
}
list(MedlineCitation = func)
}

myfunctions <- branchFunction()

# -----------------------------------------------------------------------------
# RUN
xmlEventParse(file = file.path(dataDir, fileName),
handlers = NULL,
branches = myfunctions)

Test File and output

~/dev/stackoverflow/47162861/data/medsamp2015.xml

$ ll                                                            
total 2128
drwxr-xr-x@ 7 hidden staff 238B Nov 10 11:05 .
drwxr-xr-x@ 9 hidden staff 306B Nov 10 11:11 ..
-rw-r--r--@ 1 hidden staff 32K Nov 10 11:12 1.csv
-rw-r--r--@ 1 hidden staff 20K Nov 10 11:12 2.csv
-rw-r--r--@ 1 hidden staff 23K Nov 10 11:12 3.csv
-rw-r--r--@ 1 hidden staff 37K Nov 10 11:12 4.csv
-rw-r--r--@ 1 hidden staff 942K Nov 10 11:05 medsamp2015.xml

Runtime Output

> ./invoke.sh > runtime.txt
+ R --no-save -q --slave --args https://www.nlm.nih.gov/databases/dtd medsamp2015.xml
Loading required package: XML

File: runtime.txt

How to read large (~20 GB) xml file in R?

In XML package the xmlEventParse function implements SAX (reading XML and calling your function handlers). If your XML is simple enough (repeating elements inside one root element), you can use branches parameter to define function(s) for every element.

Example:

MedlineCitation = function(x, ...) {
#This is a "branch" function
#x is a XML node - everything inside element <MedlineCitation>
# find element <ArticleTitle> inside and print it:
ns <- getNodeSet(x,path = "//ArticleTitle")
value <- xmlValue(ns[[1]])
print(value)
}

Call XML parsing:

xmlEventParse(
file = "http://www.nlm.nih.gov/databases/dtd/medsamp2015.xml",
handlers = NULL,
branches = list(MedlineCitation = MedlineCitation)
)

Solution with closure:

Like in Martin Morgan, Storing-specific-xml-node-values-with-rs-xmleventparse:

branchFunction <- function() {
store <- new.env()
func <- function(x, ...) {
ns <- getNodeSet(x, path = "//ArticleTitle")
value <- xmlValue(ns[[1]])
print(value)
# if storing something ...
# store[[some_key]] <- some_value
}
getStore <- function() { as.list(store) }
list(MedlineCitation = func, getStore=getStore)
}

myfunctions <- branchFunction()

xmlEventParse(
file = "medsamp2015.xml",
handlers = NULL,
branches = myfunctions
)

#to see what is inside
myfunctions$getStore()


Related Topics



Leave a reply



Submit