HDF5 Notes Ghosh
From canSAS
canSAS 2D treated data
1. The treated data to be stored might include: a. sequences of 1D data, measured against a scan variable, eg temperature b. one or a sequence of 2D maps, again with possibility of scan variables eg shear rate, temperature, tomography scan etc. c. For neutron work the data might be recorded as multiple wave-length slices (TOF); for X-rays these might be from an energy dispersive detector e. the data may be measured with polarised neutrons f. results from multiple detector configurations. The common characteristic is the intensity, S(Q,v1,v2..), or S(Qx,Qy,v1,v2...), the deviation in S, Q, Qx, Qy, v1... etc. In addition, continuing from the discussions of canSAS 1D XML data, it is useful to add titles, short titles, and a history of precursor(s) and treatment (SASprocessnote). Since the contents have been corrected as far as possible to remove instrument characteristics there is no need to propagate these items in this treated data; they remain fully described in the raw data. It should thus be possible to merge data from different instruments directly at this stage (so item c. might even be results from distinct instruments). 2. Storage Formats a. The three platforms where data will be treated are PC-Windows, Macintosh-OSX, and Linux. b. The primary aim of the canSAS-1D format was to provide a data file which could be automatically read by existing well known programs. The XML design allows the files to be imported easily into Excel and to be displayed clearly by most web browsers. The stylised ASCII text can also be read easily without any tools on all three systems. Attribute fields contained vital information, notably units, for stored values. c. For potentially large data sets the text readability is much less valuable. Navigating large tables of numbers is impracticable. This was recognised early by the initial proponents of NeXus, notably Jon Tischler (APS). The format proposed for NeXus was based on HDF, the Hierachical Data Format (1990-). This stored data in system independent binary, in a single file. For the user this was strucured superficially like a unix file structure, with possiblities of creating links (these, for example, allow creating a slab of a selected region of data as a new entity, but without actually copying the data, simply linking the remapping information to the initial data). The attraction for very large image data with multiple parameters is evident. Consequently a number of general data visualisation tools have been developed, and are freely available for all three computer platforms. It was not initially designed to have large volumes of single valued metadata. d. The NeXus project started in 1995 with the aim of standardising storage of Neutron and Xray scattering data. The file format chosen was HDF, based on existing tools and generic visualisation software, but only using a very limited subset for simplicity. A hierachical dictionary of class names was created and these were attached to the hierachical data items as attributes. The datafile components were hence navigable through these class names (see appendix 1). The files are however standard HDF files. e. In 2003 the increasing limitations (complexity, lack of text methods, etc ) of the HDF file format (versions 1 to 4) were finally overcome witha complete change of internal structure, creating HDF5. The existing petabytes of data in HDF4 compatible format requires continued maintenance of the basic toolset, but all new efforts were directed to HDF5. The NeXus project too had lead to considereable amounts of data being stored in the HDF4 format, and the NeXus programmers have striven to create a library which is transparent to the type of data file involved. This is done at a cost of some complexity. f. For canSAS-2D treated data the need to map the metadata between instruments has been removed; there are few reasons to follow the guidelines for writing a NeXus compatible file. There are problems of installing software which are dependent on other components, which themselves have dependencies (see appendix 2). 3. Proposal for HDF5 based data format for canSAS-2D a. The necessary identifiers for treated data were discussed for canSAS-1D. The majority of items were optional. It is necessary to extend the information concerning the actual mapped data for canSAS-2D. b. The prescription should be extensible. Space requirements are less important than ease of access, dumping sections and browsing. c. The scattering data could be stored as a sequence of data items or It would be useful for the first data component in a group to be a 2-D plottable array of corrected intensity, or simply identify with the NeXus attribute of Signal=1. Another option option might be to interpolate this (simple bi-linear, or more sophisticated splines) onto a regular grid linear in x and y. The plots would then be easier to compare and merge. Additional data items could include scan variables, short identifier titles etc. d. In addition to the intensities there could be additional similar maps of intensity deviation, x-deviation, y-deviation, etc. The notion of links allows the other data items from the first map to be included without needing copies, hence each component appears similar in "shape". e. The data could be stored additionally as a sequence of tuples of (S, S_dev, x, x_dev, y, y_dev, polarisation, etc..) This would allow different measurements to be sorted and merged, for example in TOF measurements several wavelength bands could be saved separately. With the data it is also useful to include the component information and treatment information. An attribute can contain muli-line text easily. f. Since a new set of classes would have to be created for NeXus anyway there seems little point in using/maintaining another package to read/write data. It is easy to decorate the hdf5 file with the few attributes expected in a NeXus file. a simple structure might have the following layout canSAS-2D Data S(Qx,Qy) (say 128x128,10,2) Attributes X="Qx_scale", Y="Qy_scale", Units="1/cm", Scan1="Temperature" Scan2="Polarisation", Legend="short_title", Interpolation="None" Process="multi-line analysis summary", Signal=1 Sdev(Qx,Qy) (128,128,10,2) Qx_scale (128) Qx_scale_dev (128) Qy_scale (128) Qy_scale_dev(128) Temperature(10) Polarisation(2) Sample Main_Title 4. Data browsing for HDF files Browsers can open files and restructure internal components; most importantly they can not only list the contents of data fields, but usually have several plotting options for 1D cuts, 2D maps etc. They include: HDFView http://www.hdfgroup.org/hdf-java-html/hdfview/ HDFExplorer (semi-commercial product $39) ISAW (?) PyMca http://pymca.sourceforge.net HDF5 Data manipulation h5py http://code.google.com/p/h5py The HDF5 library includes working interfaces and examples for C, C++, F77, F90 and java The HDF5 library is incorporated into Mathlab, IDL etc. The NeXus library is built on top of HDF5 and has support for python and C. The fortran90 works only on Windows. Examples (One weakness of the NeXus project was the paucity of examples, hence the divergence in local implementations.) Typical example files might include - simple SAS from radially symmetric data to check interpolation procedures etc. - GSAS having a pattern offset from nominal detector centre etc. Appendix 1 Annotated summary of NeXus raw data file Structure of Nexus HDF5 file from HDFView top level 054289.nxs group size 1 4 attributes HDF5_Version = 1.8.3 NeXus_Version = 4.2.0 file_name = /users/data/054289.nxs file_time = 2012-04-19T11:12:34+01:00 entry0 Group size = 14 Number of attributes = 1 NX_class = NXentry ! standard starting point for ! NeXus files D22 ! instrument name Group size = 14 Number of attributes = 1 NX_class = NXinstrument !instrument components & values BS Group size = 12 Number of attributes = 1 NX_class = NXbeamstop bx_actual 32-bit floating-point Number of attributes = 0 Value = 1.81 ! (no units!) bx_offset etc Detector Group size = ..... etc. for each attenuator, collimation, selector ... data Group size = 1 Number of attributes = 1 NX_class = NXdata data 32-bit integer, 128 x 128 x 1 Number of attributes = 1 signal = 1 ! shows plottable data entity value table[,,] sample Group size = 20 Number of attributes = 1 NX_class = NXsample temperature 32-bit floating-point Number or attributes = 0 san_actual ! sample rotation angle 32-bit floating-point Number or attributes = 0 Appendix 2 installing and running programs on different systems using gcc The dependencies for hdf based software are illustrated below. Macintosh OSX, Leopard 10.5.8 (2008) with Xcode, gfortran, gcc 4.2.3 Linux: Fedora Core-11 (2009) gfortran, gcc 4.4 HDFView 2.7 for pre-2010 systems, v2.8 current HDF5 version 1.8.5 for pre-2010 systems..currently 1.8.9 h5py 1.3.1.tar.gz (requires python 2.5-2.6, HDF5 1.6.5 to 1.8.5) requires HDF5 built without Fortran support (needs shared libraries) NeXus-4.2.1 Notes: requires mxml package, mxml-2.7.tar.gz Several binary packages for NeXus (.dmg, .rpm failed dependencies ) all finally rebuilt from source. To build each component requires inspecting the INSTALL information to create a suitable set of libraries. The HDF5 package built easily though one of the checks in the x86_64 linux package stopped the system (large number test). Windows - installation binaries Windows-XP for MinGW zlib- built by MSYS hdf5-1.8.5 built by MSYS -- manually: -lws2_32 added to link library list (a binary distribution exists for Intel compilers) python-2.7.3.msi numpy-1.6.2.win32-py2.7.exe h5py-2.0.1.win32-py2.7.msi NeXus-4.2.0.zip requires mxml mxml mxml-2.7.tar.gz hand-built libmxml.a (without MSYS) There are severe problems in using the most recent versions perhaps linked to the changes for the x86_64 architecture. The pre-built NeXus packages all depend on the hdf4 and hdf5 and mxml libraries. The last would require installing the MSYS package, or hand-building. Appendix 3 Summary of test file cd2_050506_001.h5 Data are from mondisperse spheres (A. Rennie) D22, run 50506+ h5dump -n cd2_050506_001.h5 HDF5 "cd2_050506_001.h5" { FILE_CONTENTS { group / group /canSAS2D group /canSAS2D/ASample dataset /canSAS2D/ASample/Title group /canSAS2D/Data dataset /canSAS2D/Data/Qx dataset /canSAS2D/Data/Qy dataset /canSAS2D/Data/S dataset /canSAS2D/Data/Sdev } } showing the structure in more detail with NeXus decorations: h5dump -A cd2_050506_001.h5 HDF5 "cd2_050506_001.h5" { GROUP "/" { GROUP "canSAS2D" { GROUP "ASample" { ATTRIBUTE "NXclass" { DATATYPE H5T_STRING { STRSIZE 8; STRPAD H5T_STR_SPACEPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): "NXsample" } } DATASET "Title" { DATATYPE H5T_STRING { STRSIZE 50; STRPAD H5T_STR_SPACEPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR } } GROUP "Data" { DATASET "Qx" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 128 ) / ( 128 ) } ATTRIBUTE "Units" { DATATYPE H5T_STRING { STRSIZE 3; STRPAD H5T_STR_SPACEPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): "1/A" } } } DATASET "Qy" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 128 ) / ( 128 ) } ATTRIBUTE "Units" { DATATYPE H5T_STRING { STRSIZE 3; STRPAD H5T_STR_SPACEPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): "1/A" } } } DATASET "S" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 128, 128 ) / ( 128, 128 ) } ATTRIBUTE "Interpolation" { DATATYPE H5T_STRING { STRSIZE 4; STRPAD H5T_STR_SPACEPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): "None" } } ATTRIBUTE "NXclass" { DATATYPE H5T_STRING { STRSIZE 6; STRPAD H5T_STR_SPACEPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): "NXdata" } } ATTRIBUTE "Process" { DATATYPE H5T_STRING { STRSIZE 80; STRPAD H5T_STR_SPACEPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 5 ) / ( 5 ) } DATA { (0): "Created by apl8 27-Jun-2012 22:01:09 MASK: m12a.msk ", (1): " AvA1 0.0000E+00 AsA2 8.2300E-01 XvA3 0.0000E+00 XsA4 8.2300E-02 XfA5 0.0000E+00", (2): "S... 50506 0 6.80E+02 sple A 0.4% Sbak 50505 0 6.79E+02 MT cell ", (3): "Cd/E 50510 0 3.40E+02 blocked beam ", (4): " " } } ATTRIBUTE "Signal" { DATATYPE H5T_STD_I32LE DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): 1 } } ATTRIBUTE "Units" { DATATYPE H5T_STRING { STRSIZE 4; STRPAD H5T_STR_SPACEPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): "1/cm" } } ATTRIBUTE "x_scale" { DATATYPE H5T_STRING { STRSIZE 2; STRPAD H5T_STR_SPACEPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): "Qx" } } ATTRIBUTE "y_scale" { DATATYPE H5T_STRING { STRSIZE 2; STRPAD H5T_STR_SPACEPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): "Qy" } } } DATASET "Sdev" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 128, 128 ) / ( 128, 128 ) } } } } } } python h5toText.py cd2_050506_001.h5 cd2_050506_001.h5 canSAS2D ASample @NXclass = ['NXsample'] Title:char[50][] = __array __array = Data Qx:float32[128] = __array @Units = ['1/A'] __array = [-0.0093729430809617043, -0.0091350506991147995, -0.0088971592485904694, '...', 0.020839333534240723] Qy:float32[128] = __array @Units = ['1/A'] __array = [-0.015177506022155285, -0.01493961364030838, -0.01470172218978405, '...', 0.015034771524369717] S:float32[128,128] = __array @NXclass = ['NXdata'] @x_scale = ['Qx'] @y_scale = ['Qy'] @Interpolation = ['None'] @Units = ['1/cm'] @Signal = [1] @Process = ['Created by apl8 27-Jun-2012 22:01:09 MASK: m12a.msk' ' AvA1 0.0000E+00 AsA2 8.2300E-01 XvA3 0.0000E+00 XsA4 8.2300E-02 XfA5 0.0000E+00' 'S... 50506 0 6.80E+02 sple A 0.4% Sbak 50505 0 6.79E+02 MT cell' 'Cd/E 50510 0 3.40E+02 blocked beam' ''] __array = [ [0.0, 0.0, 0.0, '...', 0.0] [0.0, 0.0, 0.0, '...', 0.0] [0.0, 0.0, 0.0651400014758, '...', 0.0] ... [0.0, 0.0, 0.0, '...', 0.0] ] Sdev:float32[128,128] = __array __array = [ [0.0, 0.0, 0.0, '...', 0.0] [0.0, 0.0, 0.0, '...', 0.0] [0.0, 0.0, 0.0420599989593, '...', 0.0] ... [0.0, 0.0, 0.0, '...', 0.0] ] The file is easily read and dumped by h5dump, and may be plotted with HDFView, and PyMCA. The file size is 140768 bytes for 128x128 data and errors The ASCII orginal data are 370694 bytes
The example data file may be obtained from
ftp://ftp.ill.fr/pub/cs/reg/canSAS2D/cd2_050506_001.h5
- Ron Ghosh 10:51, 2 July 2012 (CDT)