.h5 file as pytorch dataset

发表于 2021-12-16

使用h5文件用作数据集储存的优点

见stackoverflow question

存储数据尺寸无限制
存储时间与空间效率较mat文件高
自定义压缩方式
支持以切片方式读取矩阵中的部分内容，而无需读取整个矩阵

Beyond the things listed above, there’s another big advantage to a “chunked”* on-disk data format such as HDF5: Reading an arbitrary slice (emphasis on arbitrary) will typically be much faster, as the on-disk data is more contiguous on average.
matlab也支持

使用h5文件作为pytorch dataset时多线程读取报错

见Github issue

不要在dataset的__init__中打开h5文件中的dataset，但可以读取dataset的其他信息
在第一次调用__getitem__时打开h5文件中的dataset

Example:

class LXRTDataLoader(torch.utils.data.Dataset):
    def __init__(self):
        with h5py.File('img.hdf5', 'r') as f:
            self.length = len(f['dataset'])

    def open_hdf5(self):
        self.img_hdf5 = h5py.File('img.hdf5', 'r')
        self.dataset = self.img_hdf5['dataset'] # if you want dataset.

    def __getitem__(self, item: int):
        if not hasattr(self, 'img_hdf5'):
            self.open_hdf5()
        img0 = self.img_hdf5['dataset'][0] # Do loading here
        img1 = self.dataset[1]
        return img0, img1
    
    def __len__(self):
        return self.length