New bigbed_info() and bigwig_info() report header metadata without reading
any intervals. bigbed_info() returns the field counts and embedded autoSql
schema, making it possible to identify the BED variant a file holds before
reading it (a genuine BED12 has defined_field_count == 12). bigwig_info()
returns the version, zoom levels, chromosome count, and file-level summary
statistics (min/max/mean/std).
read_bigbed() now returns all BED columns for files with no embedded autoSql
schema (e.g. a bed12 written by bedToBigBed without -as). Previously such
files returned only chrom/start/end; the reader now falls back to the
field counts in the file header and names columns with the standard BED field
names (any extra bedN+ fields become generic fieldN character columns)
(#18). When a file has no embedded schema, read_bigbed() now emits a
message() noting that the column names were inferred rather than declared
by the file; silence it with suppressMessages().
Fix a CRAN gcc-san (UBSan) load of misaligned address runtime error when
reading a bigBed block that packs more than one record. In libBigWig's
bwValues.c, records are stored as three uint32_t fields followed by a
variable-length name, so every record after the first starts on an
unaligned offset; the fields are now read with memcpy instead of an
aligned uint32_t cast.
Multi-range queries now open the file once per call instead of re-opening it for every range. The per-range loop moved into C++, so a query of many ranges (and especially a remote file, where each open re-fetches headers) is substantially faster.
read_bigbed() no longer crashes on a bigBed file with no embedded autoSql
schema. bbGetSQL() returns NULL in that case, and constructing a
std::string from it was undefined behavior; such files now read back their
chrom/start/end columns with no extra typed fields.
The bigWig/bigBed readers now release the libBigWig file handle and read buffer when they error out (e.g. on an unreadable file or a failed interval query), rather than leaking them.
Fix a CRAN gcc-ASAN global-buffer-overflow reported when reading bigBed
files. The autoSql schema parser no longer uses std::regex (which tripped
an AddressSanitizer error inside libstdc++); it now parses the schema with
simple string operations.
read_bigwig() and read_bigbed() can now query multiple ranges in a single
call. Pass equal-length (or length-1, recycled) chrom, start, and end
vectors, or a GRanges of regions via chrom. For read_bigwig(as = "Rle"),
a multi-range query returns a named RleList with one element per range
(#18).
read_bigwig() gains as = "Rle", returning a per-base run-length-encoded
vector spanning the queried range (an Rle for a single chromosome, or a
named RleList for several). Uncovered bases are set to the fill value
(default 0; use NA to mark them missing) (#18).
Fix remote access to large bigWig/bigBed files. The HTTP Range header was
not being set, so servers returned the entire file, crashing R or failing to
open files larger than the read buffer (#18).
fprintf statements (which R won't allow in linked libraries) and fixups for ASAN errors, mostly GNU-specific pointer arithmetic. cpp11bigwig passes both ASAN and valgrind checks (via rhub).