Internet-Draft Multihash October 2024
Benet, et al. Expires 11 April 2025 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-multiformats-multihash-latest
Published:
Intended Status:
Standards Track
Expires:
Authors:
J. Benet
Protocol Labs
M. Sporny
Digital Bazaar
J. Caballero
Interplanetary File System Foundation

The Multihash Data Format

Abstract

Cryptographic hash functions often generate multiple output sizes and encodings. This variability makes it difficult for applications to examine a series of bytes and determine which hash function produced them, and thus such context is traditionally passed alongside the resulting bytes in defined protocols. Multihash inlines this context information so that it can travel and be translated more easily, decoupled from specific protocols.

About This Document

This note is to be removed before publishing as an RFC.

Status information for this document may be found at https://datatracker.ietf.org/doc/draft-multiformats-multihash/.

Source for this draft and an issue tracker can be found at https://github.com/ipfs-tech/multiformats-multihash-v8.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 11 April 2025.

Table of Contents

1. Feedback

This specification is a joint work product of The IPFS Foundation and the W3C Credentials Community Group. Feedback related to this specification should logged in the issue tracker and/or be sent to Multiformats Mailing List at the IETF.

2. Introduction

Multihash responds to evolving design patterns in systems which depend on cryptographically-secure hash functions, contributing to cryptographic agility and allowing for easier translation (e.g. across multiple wire formats) within a given system, and for ambient verifiability throughout a system, not just in the context of protocols. To facilitate self-describing hashes rather than context-bound ones, multihash inlines an identifier representing the hash function used (and its configuration or auxiliary inputs) as a prefix before the hash function output. This allows for cryptographic agility and provides a valuable building block to content-addressing systems and URI-safety mechanisms alike.

3. Conventions and Definitions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

4. The Multihash Fields

A multihash follows the TLV (type-length-value) pattern and consists of several fields composed of a combination of unsigned variable length integers and byte information.

4.1. Multihash Core Data Types

The following section details the core data types used by the Multihash data format.

4.1.1. Unsigned Variable Integer

A data type that enables one to express an unsigned integer of variable length. The format uses the Unsigned Little Endian Base 128 (ULEB128) encoding that was canonically defined in Appendix C of the DWARF Debugging Information Format standard, initially released in 1993, and further specified in 2011 by IRTF [RFC6256] as Self-Delimiting Numeric Values or SDNVs.

As suggested by the name, this variable length encoding is only capable of representing unsigned integers. Further, while there is no theoretical maximum integer value that can be represented by the format, implementations MUST NOT encode more than nine (9) bytes giving a practical limit of integers in a range between 0 and 2^63 - 1. When encoding an unsigned variable integer, the unsigned integer is serialized seven bits at a time, starting with the least significant bits. The most significant bit in each output byte indicates if there is a continuation byte. It is not possible to express a signed integer with this data type.

Table 1
Value Encoding (bits) hexadecimal notation
1 00000001 0x01
127 01111111 0x7F
128 10000000 00000001 0x8001
255 11111111 00000001 0xFF01
300 10101100 00000010 0xAC02
16384 10000000 10000000 00000001 0x808001

Implementations MUST restrict the size of the varint to a max of nine bytes (63 bits). In order to avoid memory attacks on the encoding, the aforementioned practical maximum length of nine bytes is used. There is no theoretical limit, and future specs can grow this number if it is truly necessary to have code or length values larger than 2^31.

4.2. Multihash Fields

A multihash follows the TLV (type-length-value) pattern.

4.2.1. Hash Function Identifier

The hash function identifier is an unsigned variable integer identifying the hash function. Possible values for this field are provided in The Multihash Identifier Registry (see IANA considerations below).

4.2.2. Digest Length

The digest length is an unsigned variable integer counting the length of the digest in bytes.

4.2.3. Digest Value

The digest value is the hash function digest with a length of exactly what is specified in the digest length, which is specified in bytes.

4.3. A Multihash Example

For example, the following is an expression of a SHA2-256 hash in hexadecimal notation (spaces added for readability purposes):

0x12 20 41dd7b6443542e75701aa98a0c235951a28a0d851b11564d20022ab11d2589a8

The first byte (0x12) specifies the SHA2-256 hash function. The second byte (0x20) specifies the length of the hash, which is 32 bytes. The rest of the data specifies the value of the output of the hash function.

5. Prior Art And Translation

In IETF's corpus of normative protocols, there are three partial overlaps of problem space worth familiarizing oneself with to minimize collisions and confusions:

5.1. Named Information Hash

The "Named Information Hash" URI scheme allows for minimally self-describing hash strings to serve as content-identifiers for arbitrary binary inputs. This lightweight identifier scheme is defined in [RFC6920] and the supported hash-context prefixes live in an IANA registry named "https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg". Its syntactic similarity to HTTP headers and support for MIME content-types makes it potentially useful for web use-cases, but use-cases are not constrained by URI scheme, only hinted at by the specification in sections 3 through 7.

One limitation of the NIH system, as a binary format, is that its registry of headers is quite small, without space for tentative, experimental, or vendored entries. Some additional entries have been added without a binary tag at all, presumably for ASCII-only use.

5.1.1. Translation from multihash to named-information hash

Some hash functions and output lengths specified in the Multihash registry below correspond to the few entries in the smaller Named Information Hash registry, leading to simple round-trip translations for multihashes produced by these dual-registered hash functions.

Formatting a multihash with any other multihash prefix as a Named Information Hash (only useful, of course, for consumers supporting both formats) is facilitated by a generic cross-registry tag for self-describing multihashes, first proposed to the NIH registry by Appendix B in the 2021 internet-draft (v3) of this same document. This also extends the NIH registry to the larger namespace of the multiformats registry.

The translation is achieved thusly:

  1. Strip the prefix bytes from the hash value and use the prefix bytes to identity the hash function used from the registry below.

  2. If the multihash prefix corresponds to any tags in the NIH registry:

    1. translate multicodec tag to NIH tag, i.e., if 0x12 (sha2-256) in multicodec registry, then 0x01 (sha256) in named-information registry

    2. transcode the hash value from "unsigned varint" to standard MSB binary

    3. (for binary form:) reattach new prefix to transcoded hash value

    4. (for ASCII form:) convert prefix to URL format, i.e., ni:///sha-256; for 0x01, and reattach to base64-encoded transcoded hash value

  3. If multihash prefix does NOT map cleanly to a registered value in NIH registry:

    1. (for binary form:) prefix existing binary multihash with 0x42 to designate that what follows is a multicodec prefix followed by an ULEB128 hash value.

    2. (for ASCII form:) convert the 0x42 prefix to URL format, i.e., ni:///mh; and then append a base64url, no-padding encoding of the entire binary multihash with prefix (and without adding the additional base-64-url-no-padding prefix, u, if using a multibase library for this base-encoding).

5.2. Using Multihashes as Namespaced UUIDs

Since the "Named Information Hash" URI scheme conforms to URL syntax (with or without an authority), each valid Named Information Hash URI can be assumed to be unique within the namespace of all valid URLs. As such, any ni:// URL (with or without an authority) can be hashed and used as a UUIDv5 in the URL namespace, i.e. 6ba7b811-9dad-11d1-80b4-00c04fd430c8 (See section 6.6).

Since this approach relies on SHA-1, and discards all but the most significant 128 bits of the hash output, its security may not be adequate for all applications, as noted in the specification. Alternative ways of using a bounded namespace could include a novel namespace registration for UUIDv5, or a UUIDv8 approach, to content-address arbitrary information with namespaced UUID variants.

6. References

6.1. Normative References

[RFC6234]
Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms (SHA and SHA-based HMAC and HKDF)", RFC 6234, DOI 10.17487/RFC6234, , <https://www.rfc-editor.org/rfc/rfc6234>.
[RFC6920]
Farrell, S., Kutscher, D., Dannewitz, C., Ohlman, B., Keranen, A., and P. Hallam-Baker, "Naming Things with Hashes", RFC 6920, DOI 10.17487/RFC6920, , <https://www.rfc-editor.org/rfc/rfc6920>.
[RFC7693]
Saarinen, M., Ed. and J. Aumasson, "The BLAKE2 Cryptographic Hash and Message Authentication Code (MAC)", RFC 7693, DOI 10.17487/RFC7693, , <https://www.rfc-editor.org/rfc/rfc7693>.
[RFC9562]
Davis, K., Peabody, B., and P. Leach, "Universally Unique IDentifiers (UUIDs)", RFC 9562, DOI 10.17487/RFC9562, , <https://www.rfc-editor.org/rfc/rfc9562>.
[FIPS202]
"SHA-3 Standard, Permutation-Based Hash and Extendable-Output Functions", , <http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.

6.2. Informative References

[RFC6256]
Eddy, W. and E. Davies, "Using Self-Delimiting Numeric Values in Protocols", RFC 6256, DOI 10.17487/RFC6256, , <https://www.rfc-editor.org/rfc/rfc6256>.
[RFC8126]
Cotton, M., Leiba, B., and T. Narten, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 8126, DOI 10.17487/RFC8126, , <https://www.rfc-editor.org/rfc/rfc8126>.
[DWARF]
"DWARF Debugging Information Format", , <http://dwarfstd.org/doc/Dwarf3.pdf>.

Appendix A. Security Considerations

TODO Security

Appendix B. Test Values

The input test data for all of the examples in this section is:

Merkle–Damgård

B.1. SHA-1

0x11148a173fd3e32c0fa78b90fe42d305f202244e2739

The fields for this multihash are - hashing function: sha1 (0x11), length: 20 (0x14), digest: 0x8a173fd3e32c0fa78b90fe42d305f202244e2739

B.2. SHA-256

0x122041dd7b6443542e75701aa98a0c235951a28a0d851b11564d20022ab11d2589a8

The fields for this multihash are - hashing function: sha2-256 (0x12), length: 32 (0x20), digest: 0x41dd7b6443542e75701aa98a0c235951a28a0d851b11564d20022ab11d2589a8

B.3. SHA-512/256

0x132052eb4dd19f1ec522859e12d89706156570f8fbab1824870bc6f8c7d235eef5f4

The fields for this multihash are - hashing function: sha2-512 (0x13), length: 32 (0x20), digest: 0x52eb4dd19f1ec522859e12d89706156570f8fbab1824870bc6f8c7d235eef5f4

B.4. SHA-512

0x134052eb4dd19f1ec522859e12d89706156570f8fbab1824870bc6f8c7d235eef5f4c2cbbafd365f96fb12b1d98a0334870c2ce90355da25e6a1108a6e17c4aaebb0

The fields for this multihash are - hashing function: sha2-512 (0x13), length: 64 (0x40), digest: 0x52eb4dd19f1ec522859e12d897061565 70f8fbab1824870bc6f8c7d235eef5f4c2cbbafd365f96fb12b1d98a0334870c2ce90 355da25e6a1108a6e17c4aaebb0

B.5. blake2b512

0xb24040d91ae0cb0e48022053ab0f8f0dc78d28593d0f1c13ae39c9b169c136a779f21a0496337b6f776a73c1742805c1cc15e792ddb3c92ee1fe300389456ef3dc97e2

The fields for this multihash are - hashing function: blake2b-512 (0xb240), length: 64 (0x40), digest: 0xd91ae0cb0e48022053ab0f8f0dc78d 28593d0f1c13ae39c9b169c136a779f21a0496337b6f776a73c1742805c1cc15e792d db3c92ee1fe300389456ef3dc97e2

B.6. blake2b256

0xb220207d0a1371550f3306532ff44520b649f8be05b72674e46fc24468ff74323ab030

The fields for this multihash are - hashing function: blake2b-256 (0xb220), length: 32 (0x20), digest: 0x7d0a1371550f3306532ff44520b649f8be05b72674e46fc24468ff74323ab030

B.7. blake2s256

0xb26020a96953281f3fd944a3206219fad61a40b992611b7580f1fa091935db3f7ca13d

The fields for this multihash are - hashing function: blake2s-256 (0xb260), length: 32 (0x20), digest: 0xa96953281f3fd944a3206219fad61a40b992611b7580f1fa091935db3f7ca13d

B.8. blake2s128

0xb250100a4ec6f1629e49262d7093e2f82a3278

The fields for this multihash are - hashing function: blake2s-128 (0xb250), length: 16 (0x10), digest: 0x0a4ec6f1629e49262d7093e2f82a3278

Appendix C. IANA Considerations

TODO - format current Contributing.md document language to align better with [RFC8126]

C.1. Initial Values for the Multihash Identifier Registry

The Multihash Identifier Registry contains hash functions supported by Multihash each with its canonical name, its value in hexadecimal notation, and its status. The following initial entries should be added to the registry to be created and maintained at (the suggested URI):

http://www.iana.org/assignments/multihash-identifiers

Table 2
Name Identifier Status Specification
identity 0x00 active n/a
sha1 0x11 active [RFC6234]
sha2-256 0x12 active [FIPS202]
sha2-512 0x13 active [FIPS202]
sha3-512 0x14 active [FIPS202]
sha3-384 0x15 active [FIPS202]
sha3-256 0x16 active [FIPS202]
sha3-224 0x17 active [FIPS202]
blake3 0x1e draft draft-aumasson-blake3 (internet-draft)
sha3-384 0x20 active [FIPS202]
sha2-256-trunc264-padded 0x1012 active [RFC6234]
sha2-224 0x1013 active [RFC6234]
sha2-512-224 0x1014 active [RFC6234]
sha2-512-256 0x1015 active [RFC6234]
k12 0x1d01 draft draft-irtf-cfrg-kangarootwelve-06
blake2b-256 0xb220 active [RFC7693]
blake2b-512 0xb240 active [RFC7693]
blake2s-256 0xb260 active [RFC7693]

NOTE: There are many draft and experimental registrations in the historical community registry, which is maintained by the IPFS Foundation on github.

C.2. The 'mh' Digest Algorithm

This memo registers the "mh" digest-algorithm in the HTTP Digest Algorithm Values registry with the following values:

Digest Algorithm: mh

Description: The multibase-serialized value of a multihash-supported
algorithm.

References: this document

Status: standard

C.3. The 'mh' Named Information Hash Algorithm

This memo registers the "mh" hash algorithm in the Named Information Hash Algorithm registry with the following values:

ID: 49

Hash Name String: mh

Value Length: variable

Reference: this document

Status: current

Acknowledgments

Thanks to Carsten Borman, Benjamin Goering, Aaron Goldman, Dirk Kutscher, and others for their substantial contributions to this document on the multiformats mailing list.

Authors' Addresses

Juan Benet
Protocol Labs
Manu Sporny
Digital Bazaar
Juan Caballero
Interplanetary File System Foundation