The Multihash Data Format

Internet-Draft	Multihash	October 2024
Benet, et al.	Expires 11 April 2025	[Page]

Abstract

Cryptographic hash functions often generate multiple output sizes and encodings. This variability makes it difficult for applications to examine a series of bytes and determine which hash function produced them, and thus such context is traditionally passed alongside the resulting bytes in defined protocols. Multihash inlines this context information so that it can travel and be translated more easily, decoupled from specific protocols.¶

4. The Multihash Fields

A multihash follows the TLV (type-length-value) pattern and consists of several fields composed of a combination of unsigned variable length integers and byte information.¶

4.1. Multihash Core Data Types

The following section details the core data types used by the Multihash data format.¶

4.1.1. Unsigned Variable Integer

A data type that enables one to express an unsigned integer of variable length. The format uses the Unsigned Little Endian Base 128 (ULEB128) encoding that was canonically defined in Appendix C of the DWARF Debugging Information Format standard, initially released in 1993, and further specified in 2011 by IRTF [RFC6256] as Self-Delimiting Numeric Values or SDNVs.¶

As suggested by the name, this variable length encoding is only capable of representing unsigned integers. Further, while there is no theoretical maximum integer value that can be represented by the format, implementations MUST NOT encode more than nine (9) bytes giving a practical limit of integers in a range between 0 and 2^63 - 1. When encoding an unsigned variable integer, the unsigned integer is serialized seven bits at a time, starting with the least significant bits. The most significant bit in each output byte indicates if there is a continuation byte. It is not possible to express a signed integer with this data type.¶

Table 1
Value	Encoding (bits)	hexadecimal notation
1	00000001	0x01
127	01111111	0x7F
128	10000000 00000001	0x8001
255	11111111 00000001	0xFF01
300	10101100 00000010	0xAC02
16384	10000000 10000000 00000001	0x808001

Implementations MUST restrict the size of the varint to a max of nine bytes (63 bits). In order to avoid memory attacks on the encoding, the aforementioned practical maximum length of nine bytes is used. There is no theoretical limit, and future specs can grow this number if it is truly necessary to have code or length values larger than 2^31.¶

4.2. Multihash Fields

A multihash follows the TLV (type-length-value) pattern.¶

4.2.1. Hash Function Identifier

The hash function identifier is an unsigned variable integer identifying the hash function. Possible values for this field are provided in The Multihash Identifier Registry (see IANA considerations below).¶

4.2.2. Digest Length

The digest length is an unsigned variable integer counting the length of the digest in bytes.¶

4.2.3. Digest Value

The digest value is the hash function digest with a length of exactly what is specified in the digest length, which is specified in bytes.¶

4.3. A Multihash Example

For example, the following is an expression of a SHA2-256 hash in hexadecimal notation (spaces added for readability purposes):¶

0x12 20 41dd7b6443542e75701aa98a0c235951a28a0d851b11564d20022ab11d2589a8

The first byte (0x12) specifies the SHA2-256 hash function. The second byte (0x20) specifies the length of the hash, which is 32 bytes. The rest of the data specifies the value of the output of the hash function.¶

5. Prior Art And Translation

In IETF's corpus of normative protocols, there are three partial overlaps of problem space worth familiarizing oneself with to minimize collisions and confusions:¶

"Named Information Hash", specified in [RFC6920], defines an hierarchical URI scheme for content-identifiers, partitioned by enumerated hash functions. The NIH registry at IANA contains all of these.¶
UUIDv5, aka "Namespaced UUIDs", defined in [RFC9562] section 5.5, does the inverse, defining a universal namespace for one hash function, partitioned by the application of that function to multiple URI schemes (i.e. DNS names, valid URLs, etc.)¶
The IANA NIH registry has a similar shape and governance mode to the IANA hashAlgorithm registry that TLS 1.2 implementations use to compactly signal supported hash+signature combinations. Since the former has different entries for some hash functions based on output length and the latter does not, the two registries are not alignable. However, given their different contexts, collisions between the two would not be a practical concern for users of either.¶

5.1. Named Information Hash

The "Named Information Hash" URI scheme allows for minimally self-describing hash strings to serve as content-identifiers for arbitrary binary inputs. This lightweight identifier scheme is defined in [RFC6920] and the supported hash-context prefixes live in an IANA registry named "https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg". Its syntactic similarity to HTTP headers and support for MIME content-types makes it potentially useful for web use-cases, but use-cases are not constrained by URI scheme, only hinted at by the specification in sections 3 through 7.¶

One limitation of the NIH system, as a binary format, is that its registry of headers is quite small, without space for tentative, experimental, or vendored entries. Some additional entries have been added without a binary tag at all, presumably for ASCII-only use.¶

5.1.1. Translation from multihash to named-information hash

Some hash functions and output lengths specified in the Multihash registry below correspond to the few entries in the smaller Named Information Hash registry, leading to simple round-trip translations for multihashes produced by these dual-registered hash functions.¶

Formatting a multihash with any other multihash prefix as a Named Information Hash (only useful, of course, for consumers supporting both formats) is facilitated by a generic cross-registry tag for self-describing multihashes, first proposed to the NIH registry by Appendix B in the 2021 internet-draft (v3) of this same document. This also extends the NIH registry to the larger namespace of the multiformats registry.¶

The translation is achieved thusly:¶

Strip the prefix bytes from the hash value and use the prefix bytes to identity the hash function used from the registry below.¶
If the multihash prefix corresponds to any tags in the NIH registry:¶
1. translate multicodec tag to NIH tag, i.e., if 0x12 (sha2-256) in multicodec registry, then 0x01 (sha256) in named-information registry¶
2. transcode the hash value from "unsigned varint" to standard MSB binary¶
3. (for binary form:) reattach new prefix to transcoded hash value¶
4. (for ASCII form:) convert prefix to URL format, i.e., ni:///sha-256; for 0x01, and reattach to base64-encoded transcoded hash value¶
If multihash prefix does NOT map cleanly to a registered value in NIH registry:¶
1. (for binary form:) prefix existing binary multihash with 0x42 to designate that what follows is a multicodec prefix followed by an ULEB128 hash value.¶
2. (for ASCII form:) convert the 0x42 prefix to URL format, i.e., ni:///mh; and then append a base64url, no-padding encoding of the entire binary multihash with prefix (and without adding the additional base-64-url-no-padding prefix, u, if using a multibase library for this base-encoding).¶

5.2. Using Multihashes as Namespaced UUIDs

Since the "Named Information Hash" URI scheme conforms to URL syntax (with or without an authority), each valid Named Information Hash URI can be assumed to be unique within the namespace of all valid URLs. As such, any ni:// URL (with or without an authority) can be hashed and used as a UUIDv5 in the URL namespace, i.e. 6ba7b811-9dad-11d1-80b4-00c04fd430c8 (See section 6.6).¶

Since this approach relies on SHA-1, and discards all but the most significant 128 bits of the hash output, its security may not be adequate for all applications, as noted in the specification. Alternative ways of using a bounded namespace could include a novel namespace registration for UUIDv5, or a UUIDv8 approach, to content-address arbitrary information with namespaced UUID variants.¶

6. References

6.1. Normative References

[RFC6234]: Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms (SHA and SHA-based HMAC and HKDF)", RFC 6234, DOI 10.17487/RFC6234, May 2011, <https://www.rfc-editor.org/rfc/rfc6234>.
[RFC6920]: Farrell, S., Kutscher, D., Dannewitz, C., Ohlman, B., Keranen, A., and P. Hallam-Baker, "Naming Things with Hashes", RFC 6920, DOI 10.17487/RFC6920, April 2013, <https://www.rfc-editor.org/rfc/rfc6920>.
[RFC7693]: Saarinen, M., Ed. and J. Aumasson, "The BLAKE2 Cryptographic Hash and Message Authentication Code (MAC)", RFC 7693, DOI 10.17487/RFC7693, November 2015, <https://www.rfc-editor.org/rfc/rfc7693>.
[RFC9562]: Davis, K., Peabody, B., and P. Leach, "Universally Unique IDentifiers (UUIDs)", RFC 9562, DOI 10.17487/RFC9562, May 2024, <https://www.rfc-editor.org/rfc/rfc9562>.
[FIPS202]: "SHA-3 Standard, Permutation-Based Hash and Extendable-Output Functions", 1 August 2015, <http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf>.
[RFC2119]: Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174]: Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.

6.2. Informative References

[RFC6256]: Eddy, W. and E. Davies, "Using Self-Delimiting Numeric Values in Protocols", RFC 6256, DOI 10.17487/RFC6256, May 2011, <https://www.rfc-editor.org/rfc/rfc6256>.
[RFC8126]: Cotton, M., Leiba, B., and T. Narten, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 8126, DOI 10.17487/RFC8126, June 2017, <https://www.rfc-editor.org/rfc/rfc8126>.
[DWARF]: "DWARF Debugging Information Format", 1 December 2005, <http://dwarfstd.org/doc/Dwarf3.pdf>.

Appendix B. Test Values

The input test data for all of the examples in this section is:¶

Merkle–Damgård

B.1. SHA-1

0x11148a173fd3e32c0fa78b90fe42d305f202244e2739

The fields for this multihash are - hashing function: sha1 (0x11), length: 20 (0x14), digest: 0x8a173fd3e32c0fa78b90fe42d305f202244e2739¶

B.2. SHA-256¶

0x122041dd7b6443542e75701aa98a0c235951a28a0d851b11564d20022ab11d2589a8

The fields for this multihash are - hashing function: sha2-256 (0x12), length: 32 (0x20), digest: 0x41dd7b6443542e75701aa98a0c235951a28a0d851b11564d20022ab11d2589a8¶

B.3. SHA-512/256¶

0x132052eb4dd19f1ec522859e12d89706156570f8fbab1824870bc6f8c7d235eef5f4

The fields for this multihash are - hashing function: sha2-512 (0x13), length: 32 (0x20), digest: 0x52eb4dd19f1ec522859e12d89706156570f8fbab1824870bc6f8c7d235eef5f4¶

B.4. SHA-512¶

0x134052eb4dd19f1ec522859e12d89706156570f8fbab1824870bc6f8c7d235eef5f4c2cbbafd365f96fb12b1d98a0334870c2ce90355da25e6a1108a6e17c4aaebb0

The fields for this multihash are - hashing function: sha2-512 (0x13), length: 64 (0x40), digest: 0x52eb4dd19f1ec522859e12d897061565 70f8fbab1824870bc6f8c7d235eef5f4c2cbbafd365f96fb12b1d98a0334870c2ce90 355da25e6a1108a6e17c4aaebb0¶

B.5. blake2b512¶

0xb24040d91ae0cb0e48022053ab0f8f0dc78d28593d0f1c13ae39c9b169c136a779f21a0496337b6f776a73c1742805c1cc15e792ddb3c92ee1fe300389456ef3dc97e2

The fields for this multihash are - hashing function: blake2b-512 (0xb240), length: 64 (0x40), digest: 0xd91ae0cb0e48022053ab0f8f0dc78d 28593d0f1c13ae39c9b169c136a779f21a0496337b6f776a73c1742805c1cc15e792d db3c92ee1fe300389456ef3dc97e2¶

B.6. blake2b256¶

0xb220207d0a1371550f3306532ff44520b649f8be05b72674e46fc24468ff74323ab030

The fields for this multihash are - hashing function: blake2b-256 (0xb220), length: 32 (0x20), digest: 0x7d0a1371550f3306532ff44520b649f8be05b72674e46fc24468ff74323ab030¶

B.7. blake2s256¶

0xb26020a96953281f3fd944a3206219fad61a40b992611b7580f1fa091935db3f7ca13d

The fields for this multihash are - hashing function: blake2s-256 (0xb260), length: 32 (0x20), digest: 0xa96953281f3fd944a3206219fad61a40b992611b7580f1fa091935db3f7ca13d¶

B.8. blake2s128¶

0xb250100a4ec6f1629e49262d7093e2f82a3278

The fields for this multihash are - hashing function: blake2s-128 (0xb250), length: 16 (0x10), digest: 0x0a4ec6f1629e49262d7093e2f82a3278¶

Appendix C. IANA Considerations

TODO - format current Contributing.md document language to align better with [RFC8126]¶

C.1. Initial Values for the Multihash Identifier Registry

The Multihash Identifier Registry contains hash functions supported by Multihash each with its canonical name, its value in hexadecimal notation, and its status. The following initial entries should be added to the registry to be created and maintained at (the suggested URI):¶

http://www.iana.org/assignments/multihash-identifiers¶

Table 2
Name	Identifier	Status	Specification
identity	0x00	active	n/a
sha1	0x11	active	[RFC6234]
sha2-256	0x12	active	[FIPS202]
sha2-512	0x13	active	[FIPS202]
sha3-512	0x14	active	[FIPS202]
sha3-384	0x15	active	[FIPS202]
sha3-256	0x16	active	[FIPS202]
sha3-224	0x17	active	[FIPS202]
blake3	0x1e	draft	draft-aumasson-blake3 (internet-draft)
sha3-384	0x20	active	[FIPS202]
sha2-256-trunc264-padded	0x1012	active	[RFC6234]
sha2-224	0x1013	active	[RFC6234]
sha2-512-224	0x1014	active	[RFC6234]
sha2-512-256	0x1015	active	[RFC6234]
k12	0x1d01	draft	draft-irtf-cfrg-kangarootwelve-06
blake2b-256	0xb220	active	[RFC7693]
blake2b-512	0xb240	active	[RFC7693]
blake2s-256	0xb260	active	[RFC7693]

NOTE: There are many draft and experimental registrations in the historical community registry, which is maintained by the IPFS Foundation on github.¶

C.2. The 'mh' Digest Algorithm

This memo registers the "mh" digest-algorithm in the HTTP Digest Algorithm Values registry with the following values:¶

Digest Algorithm: mh

Description: The multibase-serialized value of a multihash-supported
algorithm.

References: this document

Status: standard

C.3. The 'mh' Named Information Hash Algorithm

This memo registers the "mh" hash algorithm in the Named Information Hash Algorithm registry with the following values:¶

ID: 49

Hash Name String: mh

Value Length: variable

Reference: this document

Status: current

The Multihash Data Format

Abstract

About This Document

Status of This Memo

Copyright Notice

Table of Contents

1. Feedback

2. Introduction

3. Conventions and Definitions