Skip to content

Record & Pointer Types#78

Open
JTrenerry wants to merge 25 commits intomainfrom
type-refactor
Open

Record & Pointer Types#78
JTrenerry wants to merge 25 commits intomainfrom
type-refactor

Conversation

@JTrenerry
Copy link
Collaborator

@JTrenerry JTrenerry commented Mar 5, 2026

Adds two new types Records and Pointers. Primarily to support #58.

The value for Pointer is just a Bitvec.t and Records are [offset : Bitvec.t], I think records should probably be any const from AllOps.const, or something but this adds lots of complexities and importing issues

Types

Records

Records consist of a list of fields, fields are defined to be a offset, size and a type.
Syntax {(offset): type, ...}

PoInters

Pointers consist of a lower and upper type.

Syntax: ptr(lower_type, upper_type)

Operations

Records

FSET FACCESS - Field set and Field access

Pointers

PADD - Pointer addition

Changes made to other works

- lib/lang/expr_smt - Added Record to logic type, extremely unsure about this

  • lib/analysis/defuse_bool - Added basic logic about records and pointers, field information isn't taken into consideration and would require a significant rework I believe.
  • Type check changed to make sure the new operations are type correct, load and stores are not currently restricted by pointers

@JTrenerry JTrenerry self-assigned this Mar 5, 2026
@JTrenerry JTrenerry added the enhancement New feature or request label Mar 5, 2026
@JTrenerry JTrenerry marked this pull request as ready for review March 5, 2026 06:17
@JTrenerry JTrenerry requested a review from agle March 5, 2026 06:17
@JTrenerry JTrenerry marked this pull request as draft March 5, 2026 06:30
@JTrenerry JTrenerry marked this pull request as ready for review March 6, 2026 04:55
@JTrenerry
Copy link
Collaborator Author

note: warnings due to eval_smt.ml

@agle
Copy link
Owner

agle commented Mar 6, 2026

#77 contains the smt generator for datatypes but has no evaluation representation or semantics, it probably makes sense to make the records in this pr a specialisation of ADTs.
However the ADT type does not support subtyping, maybe thats a reason to keep the distinct record type and have the type checker allow subtyping if we make the algebraic subtyping the main type system of the IR; we will need an updated type checker to support this though, and I'm not sure what the value is. At the final generation phase we will have to flatten the types down as the type system of boogie/smt doesn't support subtyping.

For the purposes of generating proofs:

  • having the record field set/get intrinsics is good as it means we can generate whatever appropriate implementation for them; either in the SMTLib datatype theory or our own axiomatisation.
  • To simplify generation we disallow anonymous (unnamed) records and require a type declaration be present to define a Sort the values of the record type belong to; the inference pass will have to generate a unique name and declaration for each different record type.
  • I understand there is no ordering or relative offset semantics to the records here so we may as well make the field keys strings so we can eventually make them a human readable name? However if we don't encode a layout explicitly this makes it impossible to store records to the memory byte array---maybe an operation we need to maintain?
  • I think it would be preferable to make pointers be integers rather than bitvectors so that the pointer type encodes the assumption of well behaved pointer arithmetic; absence of overflow etc. (this will require a stronger analysis than type inference to translate the operations soundly though---the naive transform could just do pointer calculations as bitvectors then coerce the bitvector to an integer at the last minute (this would require adding a unary coercion operator). Maybe its not worth the complexity though and we could just make pointers and bitvectors interchangeable as far as the operations are concerned.
  • Bitvectors can be negative so having both poiner addition and subtraction is redundant (assuming it is signed addition)
  • If the pointer type is based on the SVA should pointer values include the region/global variable it is pointing to?

@JTrenerry
Copy link
Collaborator Author

JTrenerry commented Mar 6, 2026

I think it would be preferable to make pointers be integers rather than bitvectors so that the pointer type encodes the assumption of well behaved pointer arithmetic; absence of overflow etc. (this will require a stronger analysis than type inference to translate the operations soundly though---the naive transform could just do pointer calculations as bitvectors then coerce the bitvector to an integer at the last minute (this would require adding a unary coercion operator). Maybe its not worth the complexity though and we could just make pointers and bitvectors interchangeable as far as the operations are concerned.

I did think about this, but thinking about it more, it is a lot of effort and at least in my head would require infinite passes (not realistically but theoretically).

Bitvectors can be negative so having both poiner addition and subtraction is redundant (assuming it is signed addition)

True

If the pointer type is based on the SVA should pointer values include the region/global variable it is pointing to?

I would have to double check, but I believe load and store operations are the only thing that can make pointers so yeah they could

I understand there is no ordering or relative offset semantics to the records here so we may as well make the field keys strings so we can eventually make them a human readable name?

The current key is based on the bit offset, it can show gaps in the record, though this would be only the gaps inbetween fields or at the start and cannot show gaps at the end.

To simplify generation we disallow anonymous (unnamed) records and require a type declaration be present to define a Sort the values of the record type belong to; the inference pass will have to generate a unique name and declaration for each different record type.

Could / Should this be extracted out into it's own shared generator? i.e. all type declarations are named by the same function which uses the ID.gen setup?

having the record field set/get intrinsics is good as it means we can generate whatever appropriate implementation for them; either in the SMTLib datatype theory or our own axiomatisation.

Meaning I should add / make the record stuff intrinsic operations instead?

However the ADT type does not support subtyping

What do you mean by subtyping here? As in records cannot have a field that has a type? I don't think records really have subtypes but are composed of types instead.

@JTrenerry
Copy link
Collaborator Author

If the pointer type is based on the SVA should pointer values include the region/global variable it is pointing to?

The pointer values could store this information if it is needed. Although it may be possible for pointers to be joined when they are pointing at different regions.

@agle
Copy link
Owner

agle commented Mar 9, 2026

I did think about this, but thinking about it more, it is a lot of effort and at least in my head would require infinite passes (not realistically but theoretically).

It shouldn't be intractable; we could use a modified taint analysis and similarly to any other analysis we can push the precise pointer semantics through as far as possible and in the worst case when the safety of pointer operations degrades to a too-hard termination proof we can overapproximate/widen by inserting a trapping coercion. Like I said we can separate concerns by just always insert a trapping coercion initially and later write analyses that clean it up and improve the precision. That's not to say its neccessarily the best idea, currently all our pointers are represented by bitvectors anyway so we get some benefit from just tagging variables/values as storing pointers.

The current key is based on the bit offset, it can show gaps in the record, though this would be only the gaps inbetween [sic] fields or at the start and cannot show gaps at the end.

If you allow integer values to be record fields you are making it impossible define a finite layout for the record I am saying you have to choose one well-defined sematics for your records: either

(1) a record is an algebraic product type with disjoint fields that are accessed by a denotational accessor that is purely a name matching the field name, record values are incomparable with any other type. In this case numeric offsets are just misleading.
(2) a record is a description of a memory layout with precisely defined field offsets and sizes, fields correspond to offsets into the datatype, you define the operation of accessing multiple fields simultaneously, and allow the record to be memcpy'd as a bitvector/byte array representation. A record is a subtype of bitvector of the same size.

I believe the binsub paper is probably going to infer something more similar to (2) and we can abstract that to a type like (1) by using nested records when multiple-field accesses have been observed. (1) however puts more burden on the transform as we need a new memory representation to allow storing record values to memory without converting them to a bitvector representation. My point is just that we make a deliberate choice and don't end up with something messy in between these two.

Could / Should this be extracted out into it's own shared generator? i.e. all type declarations are named by the same function which uses the ID.gen setup?

Yeah the question is more whether we keep this attached to the program to allow generating distinct types later, it might be useful. It would probably make sense to add a field to program storing a typing context for non-local type information, but I want to ensuring that as much as possible type information is stored locally (i.e. attached to expressions not in some external datatstructure) so analyses don't have to pass through a type context object in order to reason about types.

@JTrenerry
Copy link
Collaborator Author

@katrinafyi looks like we use different formatters for nix, what one do you use?

@katrinafyi
Copy link
Collaborator

@katrinafyi looks like we use different formatters for nix, what one do you use?

I actually do the formatting myself, because I don't like the formatter styles (they all prioritise smaller diff over anything else). But if you want to use a formatter, that's fine too. I can suck it up (or make my own formatter).

@JTrenerry
Copy link
Collaborator Author

@katrinafyi looks like we use different formatters for nix, what one do you use?

I actually do the formatting myself, because I don't like the formatter styles (they all prioritise smaller diff over anything else). But if you want to use a formatter, that's fine too. I can suck it up (or make my own formatter).

Okay, I shouldn't be touching those files often anyway, and probably only flake.nix, but you may have to suck it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants