Wednesday, June 15, 2011

Paging in Windows

Windows uses demand paging algorithm to load pages in the memory. It also use cluster loading so that some of the adjacent pages are also loaded along with it (assuming that the Process would request for them sooner).

Windows have also build up some prefetcher which facilitate preloading various pages which it thinks would be used in the coming time. Prefetcher take this decision based on the usage history it has collected with time. It actually adjust priority of different pages to affect their loading behavior.

When prefetcher instigate loading of pages, the Pages get loaded and added to the Standby list. So whenever later, a Process or the System need to refer those pages, the Pages are transfered to the Working set using an inexpensive Soft page fault.

Page Priority

As with Process and Threads, Pages too have Priorities assigned to them. Standby list maintain the pages based on their priority. Lowest Priority Pages are used first for Page replacement. A number of factors determine the priority to be assigned to a given Page. This include the priority of the Process/Thread accessing this Page, the Usage history of this Page, etc.

Page Lists

Memory Manager keep list of different pages depending on the state of the Pages. It helps in managing the Page frame resource for the system. As and when need arise, Memory manager moves Pages between these lists. Following is small description of different Page lists maintained in the system:

  • Working Set: Pages (physical) those are assigned to a Process/System address space are maintained in this set.
  • Modified Page List: Pages that were written to and are removed from the Working Set are maintained in this list till the content are synced back to the pagefile. Once it is synced back to the Pagefile, the Page is moved to the Standby list.
  • Standby list: Pages in this list have almost same status as it would be in the Working Set. If need arise, it can be quickly moved back (soft page fault) to the Working Set without much work. It contain valid and most current content so no Page In I/O is required to reuse it again. Pages in this list were taken out of Working Set (sometime via Modified Page list) and were added to this list, probably for the reason that Process is not using them at the moment.
  • Free Page List: Pages are available for use. Contain invalid data. So it need proper initialization.
  • Zero Page List: Pages are available for use. Initialized to Zero. Used in case of Demand Zero Page request.

Page Frame Number (PFN) Database

Although each Page frame can be reached directly using the Frame number and the Page size, there is also certain properties attached to each frames which are maintained separately in a database. It has one record for each Page Frame. The data structure of this record may wary per page depending on the state of that Page. But some data members are common for all. Following is description of some of  the more relevant data members of PFN data structure:

  • Backward/Forward Pointer: Point to the next PFN record. It is used to link the pages when they are added to different Page list (e.g. Standby, Free, Zero, Modified Page lists, etc).
  • Page Priority
  • Reference/Shared Count
  • PTE Address/PFN of PTE: Information used to back point to PTE that was referring this page form the User/System address space.
  • Original PTE Content: Use for restoring the PTE value when the Page is removed from the Working Set of Process/System.


Monday, June 13, 2011

Prototype PTE and Shared Memory

In order to enable Sharing Memory among different process, an additional layer is added in the Virtual to Physical address translation. For each memory section thats being shared, a Segment structure (courtesy Section Object) is created. It contain the complete list of Prototype PTEs pointing to shared pages for that Section.

Prototype PTE is a special type of PTE. It forms the basic construct for supporting Shared memory in Windows. Prototype PTEs are same as any other regular PTE but with Prototype Bit field set. It contain enough information to access the desired Physical memory. Information in Prototype PTE is very much similar to any other regular PTE. It help Memory manager to bring the Pages to the Memory if it is not already there. Like for Pages backed up in the Pagefile or Mapped File, it will contain information about the Page Offset, etc.

Prototype PTE does not feature in the Page Tables and are not directly used for Address translation. They are only present in the Segment Structure. When any process open the Section object to a Shared memory the Page Table for that process is populated with PTEs that point to the Prototype PTEs in the corresponding Segment structure. When the process actually first try referring to any of the Shared Memory mapped into the Process address space, the memory manager use information in the Prototype PTE to update the Process PTE with Page Frame number of the resident shared memory.

If this was the first time any process has made reference to that shared memory, the Prototype PTE is also updated to directly point to the resident memory. During this,  the corresponding PFN database entry is also updated to indicate the number of processes sharing that memory. The PFN database entry also contain a back pointer to the Prototype PTE so that sometime later if memory manager decide to change the Page state and move it to some other location, it can use this pointer to update the Prototype PTE accordingly.



Sunday, June 12, 2011

Page States

If a given Page is in the Working Set, the PTE that points to it will have the Valid Bit flag set to One. This mean that the PTE points to a valid physical page. In this case the PTE will contain the Page Frame number for the corresponding Physical Page.

Otherwise if Valid Bit flag is Zero (broadly indicate an Invalid Page), the Page can be in one of the other special Page States. The actual state can then be determined by looking at the remaining PTE fields. Following are small description for some of these other Page states:

 - Page is backed up in the Page file: PTE will contain Page File Number and Page File Offset information.
 - Demand Zero Page: When first referenced, memory manager should allocate a Zero initialized page and assign it to the given PTE. Demand Zero Page at first would look like Page file PTE but the Page File number and the Page File offset is set to Zero.
 - Transition: Although the Page might be resident, but its not in the Working Set. It could be in the Standby list, or Modified Page list, etc. PTE will contain the Page Frame number for the resident Page. The Transition and the Prototype bit flag is set to One to indicate the Transition state of this Page. In order to use it again, the memory manager will have to include this page in Working Set.
 - Zero PTE: No Page yet assigned to this PTE. When first referenced, the memory manager should check the VADs to determine the Virtual memory reserve/commit state and act accordingly. If the Virtual memory is not committed yet, the memory manager will raise Access violation.

Thursday, June 2, 2011

Large and Small Pages

System make use of Large and Small pages for their individual merits. Although hardware these days are capable of supporting Page size as large as 1GB. But System assign an optimal size for Large and Small Page based on their performance results. For example on x86 Windows system, Small Pages are 4KB and Large Pages are 4MB (2MB on PAE systems).

Large page gives better performance as it make efficient use of TLB. When a byte is referred from a Large page its translation information is cached in TLB. This Cache will help efficiently accessing other bytes from that page next time.

On the down side, the memory protection are enforced at Page granularity. So for Large pages, many times Read only code and Read/Write data are mapped to a same page. This will relax the protection flag for this page to Read/Write. And any faulty or malicious program can write to Read only code mapped to this page and go undetected.

Windows configure their Page usage such that it can take advantage of both Large and Small pages. It maps core operating system images and data to the Large Pages. And User programs and data to Small pages. Although for debugging purpose, developer can override this behavior and run Driver Verifier to disable Large pages.

Monday, March 21, 2011

Soft Page Fault

Soft Page fault refers to migrating an already resident page to another Page frame. One of the application area for this kind of mechanism is moving pages among NUMA nodes based on their affinity (in other words, moving pages to their ideal node).

Expanding Kernel Stack

Although there is limit attach to Kernel Stack size for each Thread (12KB+4KB Guard Page). But there are mechanisms supported by Windows to provide efficient expansion of Kernel stacks. In this it allocates additional 16 KB when stack growth near guard page. And during the unwinding, it de-allocates the additional 16KB extensions. Kernel driver make use to KeExpandKernelStackAndCallout to this. But I guess, a judicious use is warranted.

Monday, March 14, 2011

TLB: Translation Look-Aside Buffer

TLB is actually a map table maintained by the system to map Virtual Page address and its Page Frame number (or Physical Page address). It is used to Cache the translation address of frequently referenced virtual address. Each time a process context switch happen, the entries in this table where were private to this process gets invalidated. Rest (specifically those with Global bit on), such as System space pages, remain there.

If virtual page is paged out or its PTE is changes, then Memory manager explicitly invalidates its entry (if present) in TLB.


Page Table Entry (PTE) Flags

In a 32-bit Windows Operating System, Bits 0 to 11 in a PTE stands for different memory management flags associated with the Page table referred to by that PTE. Following are small description for some of the important ones:

 - Accessed: Page has been read.
 - Copy-On-Write: Usually for shared memory. When are process Write on these pages, a copy is made and the copy is made private to that process.
 - Dirty: Page has been written to.
 - Global: Translation applies to all the process. Translation Buffer (TLB) flush does not affect this PTE.
 - Prototype: Sw Flag used as a construct for sharing the memory.
 - Valid: Translate to Valid Physical Memory.
 - Write: Indicate whether the page is writable.

System Space Mapping to Users Virtual Address Space

System address space (and Session space, if applicable) is mapped to all the User Process Virtual address space. System Address space consists of shared memory which can be accessed by all the processes. These memory are shared using the Section objects which uses Page tables that can be shared. How exactly this sharing is achieved can be discussed in another post sometime later.

At each Process initialization and when its Page Directory is being initialized, its also get updated for the PDEs that corresponds to System Space Virtual addresses.

Page Directory

In the normal scheme of things, one Page Directory is associated with each process. Whenever a process context switch happen, the Page Directory physical address is load into a designated register (sometimes it is CR3). Per process Page Directory physical address is maintained in the Process Block for that Process.

In addition to this, the address of Page Directory is also mapped to some system defined Virtual address. This address typically remain same for all the process. So the system would do all the necessary initialization to facilitate this mapping.

Virtual to Physical Memory Address

Taking example of 32-bit Windows Operating system. There are basically three types of entities involved in the address translation: Page Directory, Page Tables, Physical Pages. Page Directory contain pointers (PDEs) to Page Tables. Page Tables contain pointers (PTEs) to Data Pages (Pages those contain actual data referenced by the Virtual address). Each individual Page Directory and Page Table are maintained in separate Physical pages.

A virtual address will be 32-bit in size and will contain following members:

 - Page Directory Entry (PDE) Index: 10 Bits
 - Page Table Entry (PTE) Index: 10 Bits
 - Byte Index: 12 Bits.

Since page size is 4KB, 12-bit Byte Index is sufficient to reference each individual byte in a page. Similarly 10 bit PDE and PTE Index are sufficient to refer each individual 4-byte word in a page.
(Number of Bytes in page: 4096, Number of PDE/PTE in a page: 1024)

PDE and PTE are nothing but pointer to actually Physical pages in the memory. They have a similar structure and same size. In this case they are 4 bytes in size.

PDE/PTE consists of two parts: PFN (Page Frame Number) and Sw/Hw Flags associated with that pages.
Flags help memory manager to manage the pages.

PFN is nothing but physical location of each page. PFN is 20-bits long and Page size is 4KB, therefore it can be used to address 4GB physical memory.

The scheme can be modified for different application areas such as addressing large address space, ability to recognize more thatn 4GB of physical memory. The modification include adding more levels in translation, increasing PTE/PDE size, etc.

Wednesday, February 9, 2011

Stack Cookie and Encoded pointers

Stack cookie and Embedded pointers are used for protecting from malicious softwares. However they could also be useful debugging tools.

Stack cookies are signature that a compiler put at the start of each frame when expanding. Compiler also puts a special prologue and epilogue around the function definition. It checks the value while winding back the stack frame and if the check fails it calls the exception handler (a standard one provided by the framework or an extended one or an entirely custom made provided by developer).

 Encoded pointers are basically are encrypted pointer values which when accessed needs to be decrypted in order to be used. An invalid value can raise exception. It would be interesting to work out the last validation step.

There would be performance penalty. Couldn't it be possible to design sentinel like objects (independent machineries) which does all these validation and send messages to interested entities.

Tuesday, February 8, 2011