Region-based Memory Management #18

New Issue

pca006132 · 2021-06-21T16:03:59+08:00

pca006132 commented

2021-06-21 16:03:59 +08:00

The escape analysis in the existing artiq compiler is both unsound (could be fixed) and restrictive. The restriction is due to requiring everything to be stack-allocated, data used by both the caller and callee must be allocated by the caller. Professor Lionel PARREAUX (@LPTK) proposed an alternative design by using region-based memory management. This issue would give an overview of the idea.

Resources:

Proposal (Stack-Based Memory Management Through Regions)
Paper: Region-Based Memory Management
Paper: A Retrospective on Region-Based Memory Management
Chapter 3 of Advanced Topics in Types and Programming Languages.

Our current memory management method is to allocate in the current stackframe, and the stackframe together with all allocated values would be forgotten once the current function returns. Hence, one cannot allocate in the callee and return it back to the caller, as the allocated value would be dropped. Instead, we may have to allocate the buffer from the caller and copy the data to it.

def foo():
    # not possible, the allocated list would be
    # dropped as soon as the function is returned
    return [1, 2, 3, 4]

# we have to do this instead:
def foo1(buf):
    for i in range(4):
        buf[i] = i + 1

Region-based memory management is similar to existing stack allocation, the major difference is that, the allocation can choose which region to allocate. Consider the following pseudocode (the syntax is a bit different from the one in the notion notebook):

def foo(reg):
    # the list would be allocated in a region provided by the caller
    list: Stack[List[int], reg] = [1, 2, 3, 4]
    return list
    
def bar():
    # the region would be dropped as soon as the with block is ended
    with stackreg() as reg:
        print(foo(reg))

The foo function would take a region as a parameter, allocate the list in the region and return the list back to the caller. The caller could pass the region into the foo function, and use the returned list without memory corruption as the value is not dropped until the region is dropped.

Note that we expect the user to write in an implicit syntax, which should be similar (if not the same) as the current artiq syntax. The above explicit syntax is the one we would get after doing region inference.

Each region is similar to a stackframe: 1) The region can only grow but not shrink, 2) Dropping the region would drop all values allocated in it, 3) Regions are dropped according to a stack discipline, the inner regions would be dropped before the outer region. To reduce memory overhead, one function can define multiple regions and drop some of them before the function is returned. However, note that as each region can grow independently, we cannot use a 1D stack to implement the region stack, otherwise an outer region can grow and overlap with an inner region. Skipping the implementation details for now, we should be able to allocate and drop variables and regions in constant time, under the assumption that they are smaller than a certain fixed size. This would be further explained at the end of this issue.

To avoid implicit memory leak, our system would prevent dropping allocated values (allocated in an outer region) within a loop. This would be done by a separate static analysis phase rather than in the type system.

def foo():
    with stackreg() as r:
        for i in range(10):
            # forbidden, as the list would be dropped in the next iteration
            list: Stack[List[int], r] = []

def bar():
    for i in range(10):
        with stackreg() as r:
            # OK now
            list: Stack[List[int], r] = []

Behavior of functions would be analyzed via effect system, which summerize the regions which the function would access/allocate/drop.

For implementation of the region, we would implement each region as a list of fixed-size blocks, and allocate variables within the blocks. When a block is full, we would allocate a new block for the region. These can be done in constant time by having a static buffer of blocks linked as a linked list. However, one problem would be allocating long arrays which can be a lot larger than the size of the fixed-size blocks. There are currently a few ways to implement them:

A linked-list of segments.
Reference counted object.
Implement multiplicity analysis (whether the region is bounded in size or infinite), force the array to be allocated in a finite region and allocate in the stack.

The escape analysis in the existing artiq compiler is both unsound (could be fixed) and restrictive. The restriction is due to requiring everything to be stack-allocated, data used by both the caller and callee must be allocated by the caller. Professor Lionel PARREAUX (@LPTK) proposed an alternative design by using region-based memory management. This issue would give an overview of the idea. Resources: - [Proposal](https://www.notion.so/Memory-management-for-NAC3-s-Real-Time-Enabled-Python-2482c203de8949a28d851de082a51c36) (Stack-Based Memory Management Through Regions) - [Paper: Region-Based Memory Management](https://www.sciencedirect.com/science/article/pii/S0890540196926139) - [Paper: A Retrospective on Region-Based Memory Management](https://link.springer.com/content/pdf/10.1023/B:LISP.0000029446.78563.a4.pdf) - Chapter 3 of Advanced Topics in Types and Programming Languages. --- Our current memory management method is to allocate in the current stackframe, and the stackframe together with all allocated values would be forgotten once the current function returns. Hence, one cannot allocate in the callee and return it back to the caller, as the allocated value would be dropped. Instead, we may have to allocate the buffer from the caller and copy the data to it. ```python def foo(): # not possible, the allocated list would be # dropped as soon as the function is returned return [1, 2, 3, 4] # we have to do this instead: def foo1(buf): for i in range(4): buf[i] = i + 1 ``` Region-based memory management is similar to existing stack allocation, the major difference is that, the allocation can choose which region to allocate. Consider the following pseudocode (the syntax is a bit different from the one in the notion notebook): ```python def foo(reg): # the list would be allocated in a region provided by the caller list: Stack[List[int], reg] = [1, 2, 3, 4] return list def bar(): # the region would be dropped as soon as the with block is ended with stackreg() as reg: print(foo(reg)) ``` The `foo` function would take a *region* as a parameter, allocate the list in the region and return the list back to the caller. The caller could pass the region into the `foo` function, and use the returned list without memory corruption as the value is not dropped until the region is dropped. Note that we expect the user to write in an implicit syntax, which should be similar (if not the same) as the current artiq syntax. The above explicit syntax is the one we would get after doing region inference. Each region is similar to a stackframe: 1) The region can only grow but not shrink, 2) Dropping the region would drop all values allocated in it, 3) Regions are dropped according to a stack discipline, the inner regions would be dropped before the outer region. To reduce memory overhead, one function can define multiple regions and drop some of them before the function is returned. However, note that as each region can *grow independently*, we cannot use a 1D stack to implement the region stack, otherwise an outer region can grow and overlap with an inner region. Skipping the implementation details for now, we should be able to allocate and drop variables and regions in constant time, *under the assumption that they are smaller than a certain fixed size*. This would be further explained at the end of this issue. To avoid implicit memory leak, our system would prevent dropping allocated values (allocated in an outer region) within a loop. This would be done by a separate static analysis phase rather than in the type system. ```python def foo(): with stackreg() as r: for i in range(10): # forbidden, as the list would be dropped in the next iteration list: Stack[List[int], r] = [] def bar(): for i in range(10): with stackreg() as r: # OK now list: Stack[List[int], r] = [] ``` Behavior of functions would be analyzed via effect system, which summerize the regions which the function would access/allocate/drop. For implementation of the region, we would implement each region as a list of fixed-size blocks, and allocate variables within the blocks. When a block is full, we would allocate a new block for the region. These can be done in constant time by having a static buffer of blocks linked as a linked list. However, one problem would be allocating long arrays which can be a lot larger than the size of the fixed-size blocks. There are currently a few ways to implement them: 1. A linked-list of segments. 2. Reference counted object. 3. Implement multiplicity analysis (whether the region is bounded in size or infinite), force the array to be allocated in a finite region and allocate in the stack.

LPTK commented

2021-06-21 18:33:38 +08:00

Thanks for the nice summary of the proposal, @pca006132.

Another approach to handling large arrays is to go through one layer of indirection. If the array size does not fit into a region, allocate all the elements of the array in separate regions, and represent the array as a table specifying in which of these regions each index range is allocated.

For instance, assume region blocks (or let's call them region pages) are 16KiB, or 16,384 bytes. If we have 1GiB of memory (which is apparently the case on the considered hardware), then there are at most 2^16 such pages to reference, so page addresses can be encoded on 16 bits, or two bytes.

Arrays up to about 16,378 bytes can be stored in a simple way, with each element after the other, except possibly for one discontinuity (if the array overlaps between two pages). Before the elements, the array representation includes both the array size and the number of elements of the array that are on the same page, as opposed to the next one. If we store pointers to the next page at the end of each region, we can easily access the following elements of the array which are located on the next page.

For bigger arrays, we could use an entire page as a table containing references to up to 8,192 other pages that hold the array elements. To know where the element at an index i is located, we divide i by 16,384 (using bit shift) and look up the page at this index in the table. We then use the remainder of the division to index the corresponding page. This method can accommodate arrays of up to 128MiB, which I assume would be enough for most applications.

Bigger arrays could probably be preallocated statically, and live for the duration of the program.

Thanks for the nice summary of the proposal, @pca006132. Another approach to handling large arrays is to go through one layer of indirection. If the array size does not fit into a region, allocate all the elements of the array in separate regions, and represent the array as a table specifying in which of these regions each index range is allocated. For instance, assume region blocks (or let's call them region pages) are 16KiB, or 16,384 bytes. If we have 1GiB of memory (which is apparently the case on the considered hardware), then there are at most 2^16 such pages to reference, so page addresses can be encoded on 16 bits, or two bytes. Arrays up to about 16,378 bytes can be stored in a simple way, with each element after the other, except possibly for one discontinuity (if the array overlaps between two pages). Before the elements, the array representation includes both the array size and the number of elements of the array that are on the same page, as opposed to the next one. If we store pointers to the next page at the end of each region, we can easily access the following elements of the array which are located on the next page. For bigger arrays, we could use an entire page as a table containing references to up to 8,192 other pages that hold the array elements. To know where the element at an index i is located, we divide i by 16,384 (using bit shift) and look up the page at this index in the table. We then use the remainder of the division to index the corresponding page. This method can accommodate arrays of up to 128MiB, which I assume would be enough for most applications. Bigger arrays could probably be preallocated statically, and live for the duration of the program.

sb10q commented

2021-06-21 19:17:19 +08:00

Another approach to handling large arrays is to go through one layer of indirection. If the array size does not fit into a region, allocate all the elements of the array in separate regions, and represent the array as a table specifying in which of these regions each index range is allocated.

We want compatibility with FORTRAN (to be able to use linear algebra libraries), so this does not sound like a possibility.

> Another approach to handling large arrays is to go through one layer of indirection. If the array size does not fit into a region, allocate all the elements of the array in separate regions, and represent the array as a table specifying in which of these regions each index range is allocated. We want compatibility with FORTRAN (to be able to use linear algebra libraries), so this does not sound like a possibility.

LPTK commented

2021-06-21 20:50:26 +08:00

Ah ok, I see.

In that case, the simplest solution is to force these arrays to be on the call stack (or on an auxiliary stack), as you say. This could be enforced in the language by having to declare and use them as with resources, and the type system could ensure that references to them would not leak.

Reference-counting would not work, unless we choose to do a lot more work upon region deallocation, scanning the deallocated region for references whose counts need decrementing. But this would make region deallocation not constant-time anymore, and memory management would then introduce hard-to-predict latencies.

Ah ok, I see. In that case, the simplest solution is to force these arrays to be on the call stack (or on an auxiliary stack), as you say. This could be enforced in the language by having to declare and use them as `with` resources, and the type system could ensure that references to them would not leak. Reference-counting would not work, unless we choose to do a lot more work upon region deallocation, scanning the deallocated region for references whose counts need decrementing. But this would make region deallocation not constant-time anymore, and memory management would then introduce hard-to-predict latencies.

Sign in to join this conversation.

No Label

No Milestone

No Assignees

3 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/nac3-spec#18