How can I get rid of the embedded NUL on the original vector?
Andres Mora
I am scraping an ASP.NET website.
This returns a raw element ( reporte_nacido
) which is a csv file (tabs as delimiters):
reporte_nacido = postForm('https://xxxxx/WebSiteNDE/BirthsPages/FiltrosExcelNac.aspx',
.params = params,
curl = curl,
.opts = RCurl::curlOptions(ssl.verifypeer=FALSE, verbose=T))
If I load the file on a text viewer it looks like this
Now I am trying to load that primitive element in R and I am getting the following error. I believe the file downloaded from the server is corrupted somehow, R is picky about that
rawToChar(as.vector(unlist(reporte_nacido)))
Error in rawToChar(as.vector(unlist(reporte_nacido))) :
embedded nul in string: '\xfe\xff\0N\0\xda\0M\0E\0R\0O\0 \0C\0E\0R\0T\0I\0F\0I\0C\0A\0D\0O\0\t\0D\0E\0P\0A\0R\0T\0A\0M\0E\0N\0T\0O\0\t\0M\0U\0N\0I\0C\0I\0P\0I\0O\0\t\0A\0R\0E\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0I\0N\0S\0P\0E\0C\0C\0I\0O\0N\0 \0C\0O\0R\0R\0E\0G\0I\0M\0I\0E\0N\0T\0O\0 \0O\0 \0C\0A\0S\0E\0R\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0S\0I\0T\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0C\0\xd3\0D\0I\0G\0O\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0N\0O\0M\0B\0R\0E\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0S\0E\0X\0O\0\t\0P\0E\0S\0O\0 \0(\0G\0r\0a\0m\0o\0s\0)\0\t\0T\0A\0L\0L\0A\0 \0(\0C\0e\0n\0t\0\xed\0m\0e\0t\0r\0o\0s\0)\0\t\0F\0E\0C\0H\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0H\0O\0R\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0P\0A\0R\0T\0O\0 \0A\0T\0E\0N\0D\0I\0D\0O\0 \0P\0O\0R\0\t\0T\0I\0E\0M\0P\0O\0 \0D\0E\0 \0G\0E\0S\0T\0A\0C\0I\0\xd3\0N\0\t\0N\0\xda\0M\0E\0R\0O\0 \0C\0O\0N\0S\0U\0L\0T\0A\0S\0 \0P\0R\0E\0N\0A\0T\0A\0L\0E\0S\0\t\0T\0I\0P\0O\0 \0P\0A
Alan Cameron
The original vector you get is text encoded as UTF-16. You can convert it like this:
library(stringi)
raw_vec <- as.vector(unlist(reporte_nacido))
decoded <- stri_encode(raw_vec, "UTF16")
decoded
#> [1] "NÚMERO CERTIFICADO\tDEPARTAMENTO\tMUNICIPIO\tAREA NACIMIENTO\tINSPECCION CORREGIMIENTO O CASERIO NACIMIENTO\tSITIO NACIMIENTO\tCÓDIGO INSTITUCIÓN\tNOMBRE INSTITUCIÓN\tSEXO\tPESO (Gramos)\tTALLA (Centímetros)\tFECHA NACIMIENTO\tHORA NACIMIENTO\tPARTO ATENDIDO POR\tTIEMPO DE GESTACIÓN\tNÚMERO CONSULTAS PRENATALES\tTIPO PA"
It seems to be in tab delimited rather than csv format, so you might want to read it like this:
read.table(text = decoded, sep = "\t", header = TRUE)