.h5 file as pytorch dataset
使用h5文件用作数据集储存的优点
见stackoverflow question
- 存储数据尺寸无限制
- 存储时间与空间效率较mat文件高
- 自定义压缩方式
- 支持以切片方式读取矩阵中的部分内容,而无需读取整个矩阵
Beyond the things listed above, there’s another big advantage to a “chunked”* on-disk data format such as HDF5: Reading an arbitrary slice (emphasis on arbitrary) will typically be much faster, as the on-disk data is more contiguous on average.
- matlab也支持
使用h5文件作为pytorch dataset时多线程读取报错
见Github issue
- 不要在dataset的
__init__
中打开h5文件中的dataset,但可以读取dataset的其他信息 - 在第一次调用
__getitem__
时打开h5文件中的dataset
Example:
class LXRTDataLoader(torch.utils.data.Dataset):
def __init__(self):
with h5py.File('img.hdf5', 'r') as f:
self.length = len(f['dataset'])
def open_hdf5(self):
self.img_hdf5 = h5py.File('img.hdf5', 'r')
self.dataset = self.img_hdf5['dataset'] # if you want dataset.
def __getitem__(self, item: int):
if not hasattr(self, 'img_hdf5'):
self.open_hdf5()
img0 = self.img_hdf5['dataset'][0] # Do loading here
img1 = self.dataset[1]
return img0, img1
def __len__(self):
return self.length