Commit 92522afe authored by Irene Keller's avatar Irene Keller
Browse files

on disk exercises

parent b148d909
Pipeline #36570 passed with stage
in 1 minute and 24 seconds
Subproject commit 1e4926fc686a58af971844858abd1e85784a38a2
Subproject commit 1e1d41264cd14fea73258015cd40c07c2f2e8089
---
title: "Introducing HDF5 files"
author: "Mike Smith"
output:
html_document:
toc: true
toc_depth: 2
toc_float: false
number_sections: false
theme: flatly
highlight: tango
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Using the rhdf5 package
```{r, message = FALSE}
library(rhdf5)
```
```{r, fig-1, echo = FALSE, fig.cap = "Example HDF5 file structure"}
knitr::include_graphics('images/hdf5_structure.jpg')
```
## Exploring an HDF5 file
To start with we're going to look at and HDF5 file produced by 10X Genomics (the original file is available from [here](https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc8k)). For the moment we aren't going to worry about the specifics of the 10X format, but use this file to demonstrate how you can take a look at the contents of any HDF5 file.
First, lets start with the function `h5ls()`:
```{r, h5ls}
h5ls(file = '../data/on-disk-data/pbmc8k_raw_gene_bc_matrices_h5.h5')
```
The output from `h5ls()` gives us an overview of the structure of the file, without really showing us the content. We can see `pbmc8k_raw_gene_bc_matrices_h5.h5` contains a single group (GRCh38) and within that group there are many datasets. We can also see what type of data each dataset contains (the `dclass` column).
### Exercise
Use `h5ls()` to examine the other HDF5 files in the `data` folder.
## Reading from HDF5 files
```{r, h5dump}
pbmc8k_list <- h5dump("../data/on-disk-data/pbmc8k_raw_gene_bc_matrices_h5.h5")
```
This list is quite large and isn't going to be used again in this tutorial, so I recommend removing it from your R session so you do not run out of RAM.
```{r, cleanup-1}
rm(pbmc8k_list)
```
Sometimes you don't want to read the entire file, just a particular dataset. **rhdf5** includes the function `h5read()`, which requires a file path and the `name` argument to identify the dataset you want within the hierarchy.
```{r, pbmc1}
pbmc8k_data <- h5read(file = '../data/on-disk-data/pbmc8k_raw_gene_bc_matrices_h5.h5', name = "/GRCh38/data")
pbmc8k_barcodes <- h5read(file = '../data/on-disk-data/pbmc8k_raw_gene_bc_matrices_h5.h5', name = "/GRCh38/barcodes")
pbmc8k_shape <- h5read(file = '../data/on-disk-data/pbmc8k_raw_gene_bc_matrices_h5.h5', name = "/GRCh38/shape")
```
## Writing to HDF5 files
As you probably expect, it's all possible to write data to HDF5 files. This is probably a less frequent operation, either because you're processing a large amount of data into something smaller and don't need to use HDF5 to store or another piece of software does hte saving for you, but it useful to know and we will use further examples of this later to explore more properties of HDF5.
In the example below we create a small matrix and write this to a dataset called `example_matrix`. We then use `h5ls()` to confirm that it's been created.
```{r, write-1}
ex_matrix <- matrix(1:10, nrow = 5, ncol = 2)
h5write(obj = ex_matrix, file = "../data/on-disk-data/my_hdf5.h5", name = "example_matrix")
h5ls("../data/on-disk-data/my_hdf5.h5")
```
### Exercise
Try adding other objects to the HDF5 file. Aspects you can try varying include:
- R data type e.g. integer, numeric, character, ...
- R object type e.g. matrix, vector, array, list, data.frame, ...
- HDF location - you can specify locations in the file hierarchy using `/` e.g. `/place/in/the/file`
```{r, your-own-examples}
h5ls(file = '../data/on-disk-data/my_hdf5.h5')
my_file <- h5read(file = '../data/on-disk-data/my_hdf5.h5', name = "/example_matrix")
animals <- c("aff", "gnu", "sau")
h5write(obj = animals, file = "../data/on-disk-data/my_hdf5.h5", name = "animal_list")
h5ls(file = '../data/on-disk-data/my_hdf5.h5')
h5createGroup("../data/on-disk-data/my_hdf5.h5", "neu")
h5createGroup("../data/on-disk-data/my_hdf5.h5", "neu/more_animals")
more_animals <- c("esel", "zebra")
h5write(obj = more_animals, file = "../data/on-disk-data/my_hdf5.h5", name = "neu/more_anmials")
## use h5write() to try adding other things to a file
```
## Deleting parts of HDF5 files
Wanting to re-save something is quite a common need - at some point we all realise we've made a mistake or updated some parameters and improved a particular analysis. What happens if you try and overwrite an existing group or dataset in an HDF5 file? What about
```{r, write-2}
## we've grown our example matrix, and want to resave it.
ex_matrix <- matrix(1:100, nrow = 5, ncol = 20)
h5write(obj = ex_matrix, file = "../data/on-disk-data/my_hdf5.h5", name = "example_matrix")
```
# Doesn't work. Element needs to be deleted first, then added again.
If you need to remove a group or dataset from an HDF5 file you can use `h5delete()`. Here we verify that removing a dataset means that it no longer shows up when we list the contents, and the file size has been reduced by the removal.
```{r h5delete1}
file.size("../data/on-disk-data/my_hdf5.h5")
h5delete(file = "../data/on-disk-data/my_hdf5.h5", name = "example_matrix")
h5ls("../data/on-disk-data/my_hdf5.h5")
file.size("../data/on-disk-data/my_hdf5.h5")
h5delete(file = "../data/on-disk-data/my_hdf5.h5", name = "more_animals")
```
## Reading attributes
Sometimes important meta-data is stored in attributes associated with groups or datasets. We can use the `all = TRUE` option in `h5ls()` to list the number of attributes (along with other data), and `h5readAttributes()` to extract these in a similar fashion to before.
```{r}
h5ls("../data/spermatogenesis_rnavelocity/AdultMouseRep3_alevin_GRCm38.gencode.vM21.spliced.intron.fl90.gentrome.k31_sce_nometa.h5ad", all = TRUE)
h5readAttributes(file = "../data/spermatogenesis_rnavelocity/AdultMouseRep3_alevin_GRCm38.gencode.vM21.spliced.intron.fl90.gentrome.k31_sce_nometa.h5ad", name = "X")
```
---
title: "Exploring single-cell file types"
author: "Mike Smith"
output:
html_document:
toc: true
toc_depth: 2
toc_float: false
number_sections: false
theme: flatly
highlight: tango
editor_options:
chunk_output_type: inline
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Single-cell datasets can be extremely large, but are typically also very sparse. This sparsity provides a few options for storing and distributing single-cell count matrices in an efficient manner. Generally this falls into one of two categories:
- Store a sparse representation of the matrix by recording only the non-zero values and an index of which row and column the value comes from. This relies on the matrix being sparse enough, that the cost of storing the indices doesn't outweigh discarding the zero entries.
- Store the complete 2-dimensional matrix and rely on compression techniques to reduce the file size. Here no additional data are stored, but for very sparse matrices the compression will not necessarily be as efficient as removing the zero values entirely.
It is of course possible to combine these approaches and compress a sparse representation, but the effect will be less dramatic.
# 10X Genomics
10X provide their count matrices (referred to as "Feature-barcode matrices") using the sparse representation. The following is taken from the 10X documentation available (here)[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/h5_matrices].
```
(root)
└── matrix [HDF5 group]
├── barcodes
├── data
├── indices
├── indptr
├── shape
└── features [HDF5 group]
├─ _all_tag_keys
├─ feature_type
├─ genome
├─ id
├─ name
├─ pattern [Feature Barcoding only]
├─ read [Feature Barcoding only]
└─ sequence [Feature Barcoding only]
```
| Column | Description |
| -- | ------------ |
| barcodes | Barcode sequences |
| data | Nonzero UMI counts in column-major order |
| indices | Zero-based row index of corresponding element in data |
| indptr | Zero-based index into data / indices of the start of each column, i.e., the data corresponding to each barcode sequence |
| shape | Matrix dimensions (# rows, # columns) |
**It's worth noting that the exact file format has changed over time, and probably will continue to do so, to reflect changes in experimental protocols and software. However the broad sparse matrix layout has remained consistent. You can examples of 10X files generated with previous versions of Cell Ranger (here)[https://support.10xgenomics.com/single-cell-gene-expression/datasets]**
The file `neuron_1k_v3_filtered_feature_bc_matrix.h5` is an example of a feature / cell matrix produced by Cell Ranger 3.0.0, and we can verify that it's structure matches the schematic above using the `h5ls()` command we saw previously. If you are provided with an HDF5 file of unknown provenance, an initial glance at the structure can often tell you a lot about how it was produced.
```{r, tenX-h5ls}
library(rhdf5)
h5ls("../data/on-disk-data/neuron_1k_v3_filtered_feature_bc_matrix.h5")
```
The sparse structure is efficient for storage and transfer, but not necessarily the easiest concept to work with directly. It's likely that in your work you probably still want to interact with the counts as if they were a matrix with rows and columns.
```{r, chunk-layout-fig, echo = FALSE, fig.cap='Representing a matrix as three vectors in the compressed sparse column format', fig.show = 'hold', out.width="100%"}
knitr::include_graphics('images/sparse_structure.png')
```
### Exercise
Can you construct an R matrix of counts from the data in the 10X file?
Hints:
- You can read the whole file into a list with `h5dump()`
- Note these indices are zero-based and R is one-based
- The `shape` represents the final matrix dimensions
- Items in `indptr` denote the start of 'blocks' of values in `indices` and `data` (there should be as many 'blocks' as there are columns)
- `indptr` also includes the last element
- Values in `indices` correspond to rows in the appropriate column
You can verify your solution by checking the sums of the values in the first 10 columns - these should be:
`1023 7210 9883 5290 18676 20196 716 6448 16837 2065`
```{r}
## Insert your code here
```
Most of the time you don't need to worry about the structure of these files, as you'll be using other software to read them and you hope the others of that software keep up-to-date with any changes to the file structure.
```{r, reading-with-packages}
library(DropletUtils)
library(Seurat)
DropletUtils::read10xCounts("../data/on-disk-data/neuron_1k_v3_filtered_feature_bc_matrix.h5")
Seurat::Read10X_h5("../data/on-disk-data/neuron_1k_v3_filtered_feature_bc_matrix.h5")
```
# Complete dense matrix
Formats such as loom as well as many of the software tools found in Bioconductor use the alternative approach and store the entire 2-dimensional matrix, including zeros, on disk.
```{r}
h5ls('../data/on-disk-data/L1_DRG_20_example.loom', recursive = FALSE)
```
In cases like this we can directly read the data into an R matrix without any further processing.
```{r}
h5read('../data/on-disk-data/L1_DRG_20_example.loom', name = "matrix")
```
## Working only with arrays
If you're interested in performing operations on single HDF5 dataset the HDF5Array package provides a neat interface. Many of the Bioconductor classes for representing single-cell data (e.g. SingleCellExperiment) make use of this. Here the meta data (Sample names, genes, etc) are stored in memory, but the actual count matrix is held on disk.
```{r, message = FALSE}
library(HDF5Array)
HDF5Array(filepath = '../data/on-disk-data/L1_DRG_20_example.loom', name = "matrix")
```
The HDF5Array package is also aware of the 10X sparse structure, and can read this directly. **Note: the argument to specify where the counts are in the file is now 'group' rather than 'name', as multiple datasets are used in the 10X format.**.
```{r, hdf5array-tenx}
HDF5Array::TENxMatrix(filepath = "../data/on-disk-data/neuron_1k_v3_filtered_feature_bc_matrix.h5", group = "matrix")
```
---
title: "Using HDF5 efficiently"
author: "Mike Smith"
output:
html_document:
toc: true
toc_depth: 2
toc_float: false
number_sections: false
theme: flatly
highlight: tango
editor_options:
chunk_output_type: inline
---
```{r, message=FALSE}
library(rhdf5)
library(HDF5Array)
```
## Reading subsets of the data
So far we've used `h5read()` or `h5dump()` to pull the entire contents of a HDF5 dataset in our R session. However this doesn't take advantage of one of the major features of HDF5 files - efficient access to subsets of a dataset.
What happens if you try to run the code below, which reads the `counts_matrix` dataset from `brain100k.h5`? Try using `h5ls()` to explore the size of the data.
```{r, too-large, eval = FALSE}
brain_data <- h5read(file = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix")
```
Sometimes data are simply too large to read into memory. This 100,000 cell matrix is actually a subset of a larger, 1.3 million cell 10X dataset. Naively reading the complete set of counts into a dense R matrix requires ~ 150GB of RAM.
We can use the `index` argument to specify the elements we want to extract from our dataset. The syntax is a little strange for R, and should be a list with the same length as the number of dimensions in our dataset - in our case that's two. Each element of the list is a vector providing the indices you want to read or `NULL` to read everything in that dimension. In the example below we will read all the rows and the first five columns.
```{r, reading-subset}
h5ls(file = "../data/on-disk-data//brain100k.h5")
brain_data_subset <- h5read(file = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix", index = list(NULL, 1:5))
head(brain_data_subset)
dim(brain_data_subset)
```
### *Exercise*
Can you modify the code to read other sets of columns? Instead of reading the first five columns try reading the last five or columns 50,001 - 50,005. You can also experiment reading a larger number of columns - perhaps 100, 1,000 or 10,000. Use `system.time()` to examine how long reading these subsets takes.
# Chunks are 100x100 in this example
# accessing consecutive columns scales linearly
# accessing randomly will take longer
```{r, eval = FALSE}
system.time(h5read(file = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix", index = list(NULL, 1:5))) # here, only 1 chunk has to be read
system.time(h5read(file = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix", index = list(NULL, c(1, 1001, 2001, 3001, 4001)))) # here, 5 chunks have to be read
system.time(h5read(file = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix", index = list(NULL, sample(1:100000, 5))))
system.time(h5read(file = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix", index = list(NULL, 99996:100000)))
system.time(h5read(file = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix", index = list(NULL, 5001:5005)))
system.time(h5read(file = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix", index = list(NULL, 1:10)))
system.time(h5read(file = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix", index = list(NULL, 1:100)))
system.time(h5read(file = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix", index = list(NULL, 1:1000)))
system.time(h5read(file = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix", index = list(NULL, 1:10000)))
```
Hopefully you found that it takes a very similar amount of time to read five consecutive columns from anywhere in the file. You should also notice that the time taken to read larger numbers of columns scales fairly linearly with the number of columns you want.
In this case where our HDF5 file just contains a counts matrix, the HDF5Array approach works equally well.
```{r, reading-subset-hdf5a}
HDF5Array::HDF5Array(filepath = "../data/on-disk-data//brain100k.h5", name = "/counts_matrix")
```
### Exercise
How does the timing scale with HDF5Array?
## Exploring the effect of chunk layout
We've established that the chunked nature of HDF5 datasets provides efficient access to subsets of that data. The story is actually more nuanced than this, as HDF5 allows a user to control the size and shape of the chunks when the file is created, which can have a great impact on the performance of any later read operations.
```{r, chunk-layout-fig, echo = FALSE, fig.cap='In HDF5 datasets can be stored as "chunks" on disk, and if only a subset of data is required only the necessary chunks need to be read. Chunk layout doesn\'t have to be symetrical in every dimension, in the 2-dimensional dataset above chunks can consist of entire rows, entire columns, or any other regular partitioning.', fig.show = 'hold', out.width="100%"}
knitr::include_graphics('images/Chunk_layout_1.png')
knitr::include_graphics('images/Chunk_layout_2.png')
```
We're going to write the same data multiple times, and take a look at how this effects the read performance. First, lets get hold of a 10,000 cell matrix.
```{r, eval = TRUE}
brain_10k <- as.matrix(HDF5Array(file = "../data/on-disk-data/brain100k.h5", name = "/counts_matrix")[,1:10000])
dim(brain_10k)
```
rhdf5 doesn't provide access to the chunk dimension settings in `h5write()`. Instead we have to use a two-step process, creating an empty dataset with `h5createDatatset()` and then writing to that with `h5write()`. Here we create the dataset to consist of one single, very large, chunk.
```{r, eval = TRUE}
h5createFile(file = "../data/on-disk-data/one_chunk.h5")
h5createDataset(file = "../data/on-disk-data/one_chunk.h5",
dataset = "one_chunk",
dims = dim(brain_10k),
storage.mode = "integer",
chunk = c(nrow(brain_10k), ncol(brain_10k)))
h5write(brain_10k, file = "../data/on-disk-data/one_chunk.h5", name = "one_chunk")
h5ls("../data/on-disk-data/one_chunk.h5")
h5createFile(file = "../data/on-disk-data/1kby1k_square_chunks.h5")
h5createDataset(file = "../data/on-disk-data/1kby1k_square_chunks.h5",
dataset = "square_chunk",
dims = dim(brain_10k),
storage.mode = "integer",
chunk = c(1000, 1000))
h5ls("../data/on-disk-data/1kby1k_square_chunks.h5")
h5createFile(file = "../data/on-disk-data/row_chunks.h5")
h5createDataset(file = "../data/on-disk-data/row_chunks.h5",
dataset = "row_chunk",
dims = dim(brain_10k),
storage.mode = "integer",
chunk = c(1, ncol(brain_10k)))
h5createFile(file = "../data/on-disk-data/col_chunks.h5")
h5createDataset(file = "../data/on-disk-data/col_chunks.h5",
dataset = "col_chunk",
dims = dim(brain_10k),
storage.mode = "integer",
chunk = c(nrow(brain_10k), 1))
```
### Exercise
Try creating other datasets with different chunk sizes. You can pick whatever you like, but I suggest trying square chunks as well as the extremes of single-row or single-column chunks. You can then again use `system.time()` to see how long it takes to read five columns:
```{r timing-chunks, results = "hold"}
system.time( h5read(file = "../data/on-disk-data/one_chunk.h5", name = "/one_chunk", index = list(NULL, 1:5)) )
system.time( h5read(file = "../data/on-disk-data/1kby1k_square_chunks.h5", name = "/square_chunk", index = list(NULL, 1:5)) )
system.time( h5read(file = "../data/on-disk-data/row_chunks.h5", name = "/row_chunk", index = list(NULL, 1:5)) )
system.time( h5read(file = "../data/on-disk-data/col_chunks.h5", name = "/col_chunk", index = list(NULL, 1:5)) )
# something is wrong in my example. row_chunks should be slower in this particular example.
```
We can also look at the file size to get an idea of how the compression is affected.
## Writing subsets
We can use the `index` syntax to write subsets of a dataset.
```{r}
h5write(rep(1000, nrow(brain_10k)), file = "../data/on-disk-data/one_chunk.h5", name = "one_chunk", index = list(NULL, 1))
HDF5Array(file = "../data/on-disk-data/one_chunk.h5", name = "/one_chunk")
```
It's not always necessary to delete datasets, but be aware of type conversions (or lack of!)
title: "Working with on-disk data"
author:
- name: Mike Smith
output:
html_document:
toc: true
toc_depth: 2
toc_float: true
number_sections: false
theme: flatly
highlight: tango
navbar:
title: "On-disk-data"
left:
- text: "Introducing HDF5 files"
href: 01-on-disk.html
- text: "Exploring single-cell HDF5 file types"
href: 02-on-disk.html
- text: "Using HDF5 efficiently"
href: 03-on-disk.html
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta http-equiv="X-UA-Compatible" content="IE=EDGE" />
<meta name="author" content="Mike Smith" />
<title>Exploring single-cell HDF5 file types</title>
<script src="site_libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="site_libs/bootstrap-3.3.5/css/flatly.min.css" rel="stylesheet" />
<script src="site_libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script src="site_libs/navigation-1.1/tabsets.js"></script>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
}
.tabbed-pane {
padding-top: 12px;
}
.html-widget {
margin-bottom: 20px;
}
button.code-folding-btn:focus {
outline: none;
}
summary {
display: list-item;
}
</style>
<style type="text/css">
/* padding for bootstrap navbar */
body {
padding-top: 60px;
padding-bottom: 40px;
}
/* offset scroll position for anchor links (for fixed navbar) */
.section h1 {
padding-top: 65px;
margin-top: -65px;
}
.section h2 {
padding-top: 65px;
margin-top: -65px;
}
.section h3 {
padding-top: 65px;
margin-top: -65px;
}
.section h4 {
padding-top: 65px;
margin-top: -65px;
}
.section h5 {
padding-top: 65px;
margin-top: -65px;
}
.section h6 {
padding-top: 65px;
margin-top: -65px;
}
.dropdown-submenu {
position: relative;
}
.dropdown-submenu>.dropdown-menu {
top: 0;
left: 100%;
margin-top: -6px;
margin-left: -1px;
border-radius: 0 6px 6px 6px;
}
.dropdown-submenu:hover>.dropdown-menu {
display: block;
}
.dropdown-submenu>a:after {
display: block;
content: " ";
float: right;
width: 0;
height: 0;