`(12) Patent Application Publication (10) Pub. No.: US 2005/0132250 A1
`(43) Pub. Date:
`Jun. 16, 2005
`Hansen et al.
`
`US 20050132250A1
`
`(54) PERSISTENT MEMORY DEVICE FOR
`BACKUP PROCESS CHECKPOINT STATES
`
`(75) Inventors: Roger Hansen, San Francisco, CA
`(US); Pankaj Mehra, San Jose, CA
`(US); Sam Fineberg, Palo Alto, CA
`(Us)
`Correspondence Address:
`HEWLETT PACKARD COMPANY
`P O BOX 272400, 3404 E. HARMONY ROAD
`INTELLECTUAL PROPERTY
`ADMINISTRATION
`FORT COLLINS, CO 80527-2400 (US)
`
`(73) Assignee: Hewlett-Packard Development Com
`pany, L.P., Houston, TX
`
`(21) Appl. No.:
`
`10/737,374
`
`(22) Filed:
`
`Dec. 16, 2003
`
`Publication Classi?cation
`
`(51) Int. Cl? ................................................... .. G06F 11/00
`(52) US. Cl. ................................................................ .. 714/5
`
`(57)
`
`ABSTRACT
`
`A system is described that includes a network interface
`attached to a persistent memory unit. The persistent memory
`unit is con?gured to receive checkpoint data from a primary
`process, and to provide access to the checkpoint data for use
`in a backup process, Which provides recovery capability in
`the event of a failure of the primary process. The netWork
`interface is con?gured to provide address translation infor
`mation betWeen virtual and physical addresses in the per
`sistent memory unit. In other embodiments, the persistent
`memory unit is capable of storing multiple updates to the
`checkpoint state. The checkpoint state and the updates to the
`checkpoint state, if any, can be retrieved by the backup
`process periodically, or all at once upon failure of the
`primary process.
`
`Processor
`M Ni
`Primary Process LU-Q \
`m
`
`Processor
`
`
`
`@- Bockuggrocess
`
`it.
`
`_*
`
`
`
`I PMM
`
`440
`
`Persistent Memory
`E
`
`NI
`m
`
`Checkpoint State —_
`
`Checkpoint Update -_
`
`@
`Checkpoint Update -_
`1Q
`
`Checkpoint Update -_
`152
`
`
`
`Patent Application Publication Jun. 16, 2005 Sheet 1 0f 5
`
`US 2005/0132250 A1
`
`400\‘
`
`Persistent Memory
`M
`Checkpoint State
`m
`
`Address Translation and
`Protection Tables
`E
`NI
`l1_
`
`Processor
`i
`M NI
`Primary Process w
`m
`Operating System
`14A
`
`Access Rights &
`Connection Contexts
`
`Processor
`NI E
`w Backup Process
`Q
`Oper?’fing System
`1%
`
`Access Rights &
`Connection Contexts
`
`FIG 7A
`
`N'
`1%
`
`PMM
`140'
`
`_
`proqc?sor
`d
`Processor
`_ — W Checkpoint State
`rec
`1&5.
`pr'mcrlly 120C955 »
`w \‘ Backup Process
`
`_
`
`1_22
`
`Persistent Memory
`£32
`
`FIG. '18
`
`
`
`Patent Application Publication Jun. 16, 2005 Sheet 2 0f 5
`
`US 2005/0132250 A1
`
`Processor
`M NI
`Primory Process 313 \
`l’LQ
`
`Persistent Memory
`Q
`
`NI
`H 4_4
`
`Processor
`1012
`Bockuggrocess
`*“
`
`NI
`
`'1 4O
`
`I
`PMM
`w
`
`—
`
`Checkpoint Stote —_
`120
`Checkpoint Update -_
`@
`Checkpoint Upciote -_
`1Q
`
`Checkpoint Updcrte -_
`132
`
`Processor
`M
`
`Persistent Memory
`‘
`102
`Processor
`—
`406
`.
`reod —
`Checkpoint Stote \_ Backup Process
`'
`420
`122
`Prrmory Process write
`M _\~> _
`——
`
`FIG. ’I D
`
`
`
`Patent Application Publication Jun. 16, 2005 Sheet 3 0f 5
`
`US 2005/0132250 A1
`
`Q2
`
`40
`206 N
`Non-Volatile
`Memory : K > 444 hi.’
`M
`—
`
`Processor
`
`@
`A
`
`1%
`
`Non-Volatile
`Secondary
`Storage
`m
`
`A
`
`Volo’rile
`Memory ‘
`Q2
`A
`
`V
`
`Bo?ery
`M
`
`> M
`m
`I
`
`FIG. 3
`
`
`
`Patent Application Publication Jun. 16, 2005 Sheet 4 0f 5
`
`US 2005/0132250 A1
`
`PM VIRTUAL
`ADDRESS
`
`BASE ADDRESS
`
`BASEADDRESS+N
`
`402 <
`404 \
`406 '
`408
`410
`442
`414
`44¢ '
`
`PM PHYSICAL
`ADDRESS
`
`0
`
`41a
`420
`422
`424
`/: 426
`428
`430
`432
`434
`436
`438
`440
`442
`444
`444
`448 M
`
`FIG. 4
`
`
`
`Patent Application Publication Jun. 16, 2005 Sheet 5 0f 5
`
`US 2005/0132250 A1
`
`con
`
`qwm
`
`_Z I I I
`
`
`
`wwm tom 6:80 59.:
`I 9% 2m in
`
`
`
`>063 EQBQEJEEQQ 623v o=wE:cocQ_<
`
`
`
`
`
`mom
`
`
`
`.wlvm I
`
`>1 : > t
`
`k 1 < < < r
`
`A \L > ‘y i
`
`
`
`
`
`90a _>_O~._ 322
`
`
`
`
`
`@0906 2m gm, wEQQOE co=oo=aa<
`
`: :
`
`I I mom
`
`§ 20
`
`
`
`Ews>w @5980
`
`am
`
`
`
`US 2005/0132250 A1
`
`Jun. 16, 2005
`
`PERSISTENT MEMORY DEVICE FOR BACKUP
`PROCESS CHECKPOINT STATES
`
`BACKGROUND
`
`[0001] Failure of a computer, as Well as application pro
`grams, executing on a computer can often result in the loss
`of signi?cant amounts of data and intermediate calculations.
`The cause of failure can be either hardWare or softWare
`related, but in either instance the consequences can be
`expensive, particularly When data manipulations are inter
`rupted in mid-stream. In the case of large softWare applica
`tions, a failure might require an extensive effort to regen
`erate the status of the application’s state prior to the failure.
`[0002] Generally, checkpoint and restoration techniques
`periodically save the process state during normal execution,
`and thereafter restore the saved state to a backup process
`folloWing a failure. In this manner, the amount of lost Work
`is minimiZed to progress made by the application process
`since the restored checkpoint.
`[0003] Traditionally, computers have stored the check
`point data in either system memory coupled to the comput
`er’s processor, or on other input/output (I/ O) storage devices
`such as magnetic tape or disk. I/O storage devices can be
`attached to a system through an I/O bus such as a PCI
`(originally named Peripheral Component Interconnect), or
`through a netWork such as Fiber Channel, In?niband, Serv
`erNet, or Ethernet. I/O storage devices are typically sloW,
`With access times of more than one millisecond. They utilize
`special I/O protocols such as small computer systems inter
`face (SCSI) protocol or transmission control protocol/inter
`net protocol (TCP/IP), and typically operate as block
`exchange devices (e.g., data is read or Written in ?xed siZe
`blocks of data). A feature of these types of storage I/O
`devices is that they are persistent such that When they lose
`poWer or are re-started they retain the information stored on
`them previously. In addition, I/O storage devices can be
`accessed from multiple processors through shared I/O net
`Works, even after some processors have failed.
`
`[0004] As used herein, the term “persistent” refers to a
`computer memory storage device that can Withstand a poWer
`reset Without loss of the contents in memory. Persistent
`memory devices have been used to store data for starting or
`restarting softWare applications. In simple systems, persis
`tent memory devices are static and not modi?ed as the
`softWare executes. The initial state of the softWare environ
`ment is stored in persistent memory. In the event of a poWer
`failure to the computer or some other failure, the softWare
`restarts its execution from the initial state. One problem With
`this approach is that all intermediate calculations Will have
`to be recomputed. This can be particularly onerous if large
`amounts of user data must be reloaded during this process.
`If some or all of the user data is no longer available, it may
`not be possible to reconstruct the pre-failure state.
`
`[0005] System memory is generally connected to a pro
`cessor through a system bus Where such memory is rela
`tively fast With guaranteed access times measured in tens of
`nanoseconds. Moreover, system memory can be directly
`accessed With byte-level granularity. System memory, hoW
`ever, is normally volatile such that its contents are lost if
`poWer is lost or if a system embodying such memory is
`restarted. Also, system memory is usually Within the same
`fault domain as a processor such that if a processor fails, the
`
`attached memory also fails and may no longer be accessed.
`Metadata, Which describes the layout of memory, is also lost
`When poWer is lost or When the system embodying such
`memory is restarted.
`
`[0006] Prior art systems have used battery-backed
`dynamic random access memory (BBDRAM), solid-state
`disks, and netWork-attached volatile memory. Prior
`BBDRAM, for example, may have some performance
`advantages over true persistent memory. It is not, hoWever,
`globally accessible. Moreover, BBDRAM that lies Within
`the same fault domain as an attached CPU Will be rendered
`inaccessible in the event of a CPU failure or operating
`system crash. Accordingly, BBDRAM is often used in
`situations Where all system memory is persistent so that the
`system may be restarted quickly after a poWer failure or
`reboot. BBDRAM is still volatile during long poWer outages
`such that alternate means must be provided to store its
`contents before batteries drain. Importantly, this use of
`BBDRAM is very restrictive and not amenable for use in
`netWork-attached persistent memory applications, for
`example.
`[0007] Battery-backed solid-state disks (BBSSD) have
`also been proposed for other implementations. These
`BBSSDs provide persistent memory, but functionally they
`emulate a disk drive. An important disadvantage of this
`approach is the additional latency associated With access to
`these devices through 1/0 adapters. This latency is inherent
`in the block-oriented and ?le-oriented storage models used
`by disks and, in turn, BBSSDs, Which do not bypass the host
`computer’s operating system. While it is possible to modify
`solid-state disks to eliminate some shortcomings, inherent
`latency cannot be eliminated because performance is limited
`by the I/O protocols and their associated device drivers. As
`With BBDRAM, additional technologies are required for
`providing the checkpoint state of an application program in
`a failed domain to a backup copy of the application program
`running in an operational domain.
`
`SUMMARY
`
`[0008] In some embodiments, a system includes a netWork
`interface attached to a persistent memory unit. The persistent
`memory unit is con?gured to receive checkpoint data from
`a primary process, and to provide access to the checkpoint
`data for use in a backup process to support recovery capa
`bility in the event of a failure of the primary process. The
`netWork interface is con?gured to provide address transla
`tion information betWeen virtual and physical addresses in
`the persistent memory unit. In other embodiments, the
`persistent memory unit is capable of storing multiple
`updates to the checkpoint state. The checkpoint state and the
`updates to the checkpoint state, if any, can be retrieved by
`the backup process periodically, or all at once upon failure
`of the primary process.
`
`[0009] In yet other embodiments, a method for recovering
`the operational state of a primary process includes mapping
`virtual addresses of a persistent memory unit to physical
`addresses of the persistent memory unit, and receiving
`checkpoint data regarding the operational state of the pri
`mary process in the persistent memory unit. In some
`embodiments, the checkpoint data is provided to a backup
`process. In still other embodiments, the context information
`regarding the addresses is provided to the primary process
`and the backup process.
`
`
`
`US 2005/0132250 A1
`
`Jun. 16, 2005
`
`[0010] In other embodiments, the persistent memory unit
`provides the checkpoint data to the backup process When the
`primary process fails. Alternatively, in still other embodi
`ments, the persistent memory unit can be con?gured to store
`multiple sets of checkpoint data sent from the processor at
`successive time intervals, or to provide the multiple sets of
`checkpoint data to the backup process at one time.
`
`[0011] These and other embodiments Will be understood
`upon an understanding of the present disclosure by one of
`ordinary skill in the art to Which it pertains.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0012] The accompanying drawings, Which are incorpo
`rated in and form a part of this speci?cation, illustrate
`embodiments of the invention and, together With the
`description, serve to eXplain its principles:
`
`[0013] FIG. 1A is a block diagram of an embodiment of
`a system that includes a netWork attached persistent memory
`unit (NPMU) capable of storing checkpoint state informa
`tion;
`[0014] FIG. 1B is a diagram of an embodiment of one
`method for accessing checkpoint state information from the
`NPMU of FIG. 1A;
`
`[0015] FIG. 1C is a block diagram of an embodiment of
`a system that includes a netWork attached persistent memory
`unit (NPMU) capable of storing multiple sets of checkpoint
`state information;
`[0016] FIG. 1D is a diagram of an embodiment of another
`method for accessing checkpoint state information from the
`NPMU of FIG. 1C;
`
`[0017] FIG. 2 is a block diagram of an embodiment of a
`netWork attached persistent memory unit (NPMU);
`
`[0018] FIG. 3 is a block diagram of an embodiment of a
`netWork attached persistent memory unit (NPMU) using
`battery backup;
`[0019] FIG. 4 is a block diagram illustrating mappings
`from a persistent memory virtual address space to a persis
`tent memory physical address space; and
`
`[0020] FIG. 5 is a block diagram of an illustrative com
`puter system on Which a netWork attached persistent
`memory unit (NPMU) can be implemented.
`
`DETAILED DESCRIPTION
`
`[0021] Whereas prior art systems have used persistent
`memory only in the conteXt of block-oriented and ?le
`oriented I/O architectures With their relatively large laten
`cies, the present teachings describe memory that is persistent
`like traditional I/O storage devices, but that can be accessed
`like system memory With ?ne granularity and loW latency.
`Systems according to the present teachings alloW application
`programs to store one or more checkpoint states, Which can
`be accessed by a backup copy of the application in the event
`of a hardWare or softWare failure that prevents the primary
`application program from eXecuting.
`[0022] As shoWn in FIG. 1, a system 100 using netWork
`attached persistent memory includes a netWork-attached
`persistent memory unit (NPMU) 102 that can be accessed by
`one or more processor nodes 104, 106 through correspond
`
`ing netWork interfaces (NI) 108, 110, and system area
`netWork (SAN) 112, such as a Remote Direct Memory
`Access (RDMA)-enabled SAN. RDMA can be implemented
`as a feature of NIs 108, 110, and 114 to enable processor
`nodes 104, 106 to directly store and retrieve information in
`the memory of NPMU 102. Transferring data directly to or
`from NPMU 102 eliminates the need to copy data betWeen
`memory in processor nodes 104, 106 and kernel I/O pro
`cesses in operating systems 144, 146. RDMA capability thus
`reduces the number of conteXt sWitches betWeen primary
`process 116 and backup process 122, and operating systems
`144, 146 While handling memory transfers via SAN 112.
`
`[0023] SAN 112 accesses NPMU 102 via netWork inter
`face (NI) 114. NPMU 102 combines the durability and
`recoverability of storage I/ O With the speed and ?ne-grained
`access of system memory. Like storage, the contents of
`NPMU 102 can survive the loss of poWer or system restart.
`Like remote memory, NPMU 102 can be accessed across
`SAN 112. HoWever, unlike directly-connected memory,
`NPMU 102 can continue to be accessed even after one or
`more processor nodes 104, 106 have failed.
`
`[0024] Primary process 116 running on processor node
`104 can initiate remote commands, for eXample, a Write
`command to send data for checkpoint state 120 in NPMU
`102. Primary process 116 can also provide data for check
`point state 120 periodically. Backup process 122 running on
`processor node 106 is con?gured to perform the functions of
`primary process 116 in the event of a failure of primary
`process 116. Backup process 122 can also initiate remote
`read and Write operations to NPMU 102, such as a read
`command to access checkpoint state 120 periodically and/or
`upon failure of primary process 116.
`
`[0025] In a Write operation initiated by processor node
`104, for eXample, once data has been successfully stored in
`NPMU 102, the data is durable and Will survive a poWer
`outage or failure of processor node 104, 106. In particular,
`memory contents Will be maintained as long as NPMU 102
`continues to function correctly, even after the poWer has
`been disconnected for an eXtended period of time, or the
`operating system on processor node 104, 106 has been
`rebooted. In addition to data transfer operations, NPMU 102
`can be con?gured to respond to various management com
`mands.
`
`[0026] In some embodiments, processor nodes 104, 106
`are computer systems that include at least one central
`processing unit (CPU) and system memory Wherein the CPU
`is con?gured to run operating systems 144, 146. Processor
`nodes 104, 106 can additionally be con?gured to run one or
`more of any type of application program, such as primary
`process 116 and backup process 118. Although system 100
`is shoWn With tWo processor nodes 104, 106, additional
`processor nodes (not shoWn) can communication With SAN
`112 as Well as With processor nodes 104, 106 over a netWork
`(not shoWn) via netWork interfaces 108, 110, 114.
`
`[0027] In some embodiments, SAN 112 is a RDMA
`enabled netWork connecting multiple netWork interface
`units (NI), such as NIs 108, 110, and 114 that can perform
`byte-level memory operations betWeen tWo processor nodes
`104, 106, or betWeen processor nodes 104, 106 and a device
`such as NPMU 102, Without notifying operating systems
`144, 146. In this case, SAN 112 is con?gured to perform
`virtual to physical address translation to map contiguous
`
`
`
`US 2005/0132250 A1
`
`Jun. 16, 2005
`
`network virtual address spaces onto discontiguous physical
`address spaces. This type of address translation allows for
`dynamic management of NPMU 102. Commercially avail
`able SANs 112 with RDMA capability include, but are not
`limited to, ServerNet, GigaNet, In?niband, and all Virtual
`Interface Architecture compliant SANs.
`
`[0028] Processor nodes 104, 106 are generally attached to
`SAN 112 through respective NIs 108, 110, however, many
`variations are possible. More generally, however, a proces
`sor node need only be connected to an apparatus for com
`municating read and write operations. For example, in
`another implementation of this embodiment, processor
`nodes 104, 106 include various CPUs on a motherboard that
`utiliZe a data bus, for eXample a PCI bus, instead of SAN
`112. It is noted that the present teachings can be scaled up
`or down to accommodate larger or smaller implementations
`as needed.
`
`[0029] Network interfaces (NI) 108, 110, 114 are commu
`nicatively coupled to NPMU 102 to allow for access to the
`persistent memory contained with NPMU 102. Any suitable
`technology can be utiliZed for the various components of
`FIG. 1A, including the type of persistent memory. Accord
`ingly, the embodiment of FIG. 1A is not limited to a speci?c
`technology for realiZing the persistent memory. Indeed,
`multiple memory technologies, including magnetic random
`access memory (MRAM), magneto-resistive random access
`memory (MRRAM), polymer ferroelectric random access
`memory (PFRAM), ovonics uni?ed memory (OUM),
`BBDRAM, and FLASH memories of all kinds may be
`appropriate. System 100 can be con?gured to allow high
`granularity memory access, including byte-level memory
`access, compared to BBSSDs, which transfer entire blocks
`of information.
`
`[0030] Notably, memory access granularity can be
`adjusted as required in system 100. The access speed of
`memory in NPMU 102 should also be fast enough to support
`the transfer rates of the data communication scheme imple
`mented for system 100.
`
`[0031] It should be noted that persistent information is
`provided to the eXtent the persistent memory in use may hold
`data. For eXample, in many applications, persistent memory
`may be required to store data regardless of the amount of
`time power is lost; whereas in another application, persistent
`memory may only be required for a few minutes or hours.
`
`[0032] Memory management functionality can be pro
`vided in system 100 to create one or more independent,
`indirectly-addressed memory regions. Moreover, NPMU
`meta-data can be provided for memory recovery after loss of
`power or processor failure. Meta-data can include, for
`eXample, the contents and layout of the protected memory
`regions within NPMU 102. In this way, NPMU 102 stores
`the data as well as the manner of using the data. When the
`need arises, NPMU 102 can provide the meta-data to backup
`process 122 to allow system 100 to recover from a power or
`system failure associated with primary process 116.
`
`[0033] In the embodiment of system 100 shown in FIG.
`1A, each update to checkpoint state 120 can overwrite some
`or all of the information currently stored for checkpoint state
`120. Since there is only one copy of checkpoint state 120,
`backup process 122 can remain idle until primary process
`116 fails, and then read checkpoint state 120 to continue the
`
`functions that were being performed by primary process
`116. FIG. 1B is a diagram of an embodiment of a method for
`accessing checkpoint state 120 from NPMU 102. As shown,
`primary process 116 writes data to the beginning address of
`checkpoint state 120, and backup process 122 reads data
`from the beginning of checkpoint state 120. In such an
`embodiment, only one copy of checkpoint state 120 needs to
`be maintained.
`
`[0034] FIG. 1C shows another embodiment of NPMU
`102 con?gured with multiple checkpoint update areas 128
`132 associated with checkpoint state 120. Checkpoint state
`120 can include a full backup state for primary process 116.
`Each update to checkpoint state 120 can be appended to the
`previously written information, thereby creating a series of
`updates to checkpoint state 120 in non-overlapping update
`areas 128-132 of NPMU 102. When checkpoint state 120 is
`relatively large, update areas 128-132 provide a bene?t of
`eliminating the need for primary process 116 to write a
`complete checkpoint state 120 each time information is
`checkpointed.
`[0035] For example, primary process 116 may read in
`large blocks of data during initialiZation, and update various
`segments of the data at different phases of operation. The
`initial checkpoint state 120 can include a backup of all the
`data, while update areas 128-132 can be used to store
`smaller segments of the data as the segments are updated.
`Backup process 122 can then initialiZe itself with checkpoint
`state 120, and apply data from the subsequent update areas
`128-132 in the order they were written. Further, backup
`process 122 does not have to wait until primary process 116
`fails to begin initialiZing itself with data from checkpoint
`state 120 and update areas 128-132. This is especially true
`when there is potential to over?ow the amount of storage
`space available for checkpoint state 120 and update areas
`128-132. This is also true when it would take a greater
`amount of time than desired for backup process 122 to
`recreate the state of primary process 116 after primary
`process 116 fails.
`
`[0036] FIG. 1D is a diagram of an embodiment of a
`method for accessing checkpoint state 120 from NPMU 102
`of FIG. 1C. As shown, primary process 116 appends data to
`the address of checkpoint state 120 and update areas (not
`shown) following the address where data was last written to
`NPMU 102. Backup process 122 reads data from the begin
`ning to the end of the last area that was written by primary
`process 116. In such embodiments, facilities are provided to
`provide backup process 122 with the starting and ending
`location of the most current update to checkpoint state 120,
`as further discussed herein.
`
`[0037] Whether backup process 122 reads checkpoint
`state 120 and update areas 128-132 periodically, or when
`primary process 116 fails, backup process 122 can read any
`previously unread portion of checkpointed state 120 and
`update areas 128-132 before taking over for primary process
`116.
`
`[0038] UtiliZing NPMU 102 allows primary process 116
`to store checkpoint state 120 regardless of the identity,
`location, or operational state of backup process 122. Backup
`process 122 can be created in any remote system that has
`access to NPMU 102. Primary process 116 can write check
`point state 120 and/or update areas 128-132 whenever
`required without waiting for backup process 122 to
`
`
`
`US 2005/0132250 A1
`
`Jun. 16, 2005
`
`acknowledge receipt of messages. Additionally, NPMU 102
`allows ef?cient use of available information technology (IT)
`resources since backup process 122 only needs to execute
`when either (1) primary process 116 fails; or (2) to periodi
`cally read information from checkpoint state 120 and/or
`update areas 128-132 to avoid over?owing NPMU 102. In
`contrast, some previously known checkpointing techniques
`utiliZe message passing between a primary process and a
`backup process to communicate checkpoint information.
`The primary process thus required information regarding the
`identity and location of the backup process. Additionally, the
`backup process had to be operational in previously known
`systems in order to synchroniZe with the primary process to
`receive the checkpoint message.
`
`[0039] Further, NPMU 102 can be implemented in hard
`ware, thereby providing fast access for read and write
`operations. Other previously known checkpointing tech
`niques store checkpoint information on magnetic or optical
`media, which requires much more time to access than
`NPMU 102.
`
`[0040] FIG. 2 shows an embodiment of NPMU 102 that
`uses non-volatile memory 202 communicatively coupled to
`NI 114 via a communications link 206. Non-volatile
`memory 202 can be, for example, MRAM or Flash memory.
`NI 114 typically does not initiate its own requests, but
`instead NI 114 receives management commands from SAN
`112 via communication link 210, and carries out the
`requested management operations. Speci?cally, NPMU 200
`can translate incoming requests and then carry out the
`requested operation. Further details on command processing
`will be discussed below. Communication links 206, 210 can
`be con?gured for wired and/or wireless communication.
`SAN 112 can be any suitable communication and processing
`infrastructure between NI 114 and other nodes such as
`processor nodes 104, 106 in FIG. 1A. For example, SAN
`112 can be a local area network, and/or wide area network
`such as the Internet.
`
`[0041] FIG. 3 shows another embodiment of NPMU 102
`using a combination of volatile memory 302 with battery
`304 and a non-volatile secondary store 310. In this embodi
`ment, when power fails, the data within volatile memory 302
`is preserved using the power of battery 304 until such data
`can be saved to non-volatile secondary store 310. Non
`volatile secondary store can be, for example, a magnetic disk
`or slow FLASH memory. The transfer of data from volatile
`memory 302 to non-volatile secondary memory store 310
`can occur without external intervention or any further power
`other than from battery 304. Accordingly, any required tasks
`are typically completed before battery 304 completely dis
`charges. As shown, NPMU 120 includes optional CPU 306
`running an embedded operating system. Accordingly, the
`backup task (i.e., data transfer from volatile memory 302 to
`non-volatile secondary memory store 310) can be performed
`by software running on CPU 306. NI 114 initiates requests
`under the control of software running on CPU 306. CPU 306
`can receive management commands from the network and
`carry out the requested management operation.
`
`[0042] Various embodiments of NPMU 102 can be man
`aged to facilitate resource allocation and sharing. In some
`embodiments, NPMU 102 is managed by persistent memory
`manager (PMM) 140, as shown in FIG. 1A. PMM 140 can
`be located internal or external to NPMU 102. When PMM
`
`140 is internal to NPMU 102, processor nodes 104, 106 can
`communicate with PMM 140 via SAN 112 and network
`interface (NI) 114 to perform requested management tasks,
`such as allocating or de-allocating regions of persistent
`memory of NPMU 102, or to use an existing region of
`persistent memory. When PMM 140 is external to NPMU
`102, processor nodes 104, 106 can issue requests to NPMU
`102, and NPMU 102 can interface with PMM 104 via NI
`114, SAN 112, and NI 141 associated with PMM 140. As a
`further alternative, processor nodes 104, 106 can commu
`nicate directly with PMM 140 via NI 108, 110, respectively,
`and SAN 112 and NI 141. PMM 140 can then issue the
`appropriate commands to NPMU 102 to perform requested
`management tasks.
`
`[0043] Note that because NPMU 102 can be durable, and
`can maintain a self-describing body of persistent data,
`meta-data related to existing persistent memory regions can
`be stored on NPMU 102. PMM 140 can perform manage
`ment tasks that will keep the meta-data on NPMU 102
`consistent with the persistent data stored on NPMU 102. In
`this manner, the NPMU’s stored data can always be inter
`preted using the NPMU’s stored meta-data and thereby
`recovered after a possible system shutdown or failure.
`NPMU 102 thus maintains in a persistent manner not only
`the data being manipulated but also the state of the process
`ing of such data. Upon a need for recovery, system 100 using
`an NPMU 102 is thus able to recover and continue operation
`from the memory state in which a power failure or operating
`system crash occurred.
`
`[0044] As described with reference to FIG. 1A, SAN 112
`provides basic memory management and virtual memory
`support. In such an implementation, PMM 140 can program
`the logic in NI 114 to enable remote read and write opera
`tions, while simultaneously protecting the persistent
`memory from unauthoriZed or inadvertent accesses by all
`except a select set of entities on SAN 112. Moreover, as
`shown in FIG. 4, NPMU 102 can support virtual-to-physical
`address translation. For example, a continuous virtual
`address space such as persistent memory (PM) virtual
`addresses 402-416 can be mapped or translated to discon
`tinuous persistent memory physical addresses 418-448. PM
`virtual addresses can be referenced relative to a base address
`through N incremental addresses. Such PM virtual
`addresses, however, can also correspond to discontiguous
`PM physical addresses.
`
`[0045] As shown, PM virtual address 402 can actually
`correspond to a PM physical address 436, and so on.
`Accordingly, NPMU 102 can provide the appropriate trans
`lation from the PM virtual address space to the PM physical
`address space and vice versa. In this way, the translation
`mechanism allows NPMU 102 to present contiguous virtual
`address ranges to processor nodes 104, 106, while still
`allowing dynamic management of the NPMU’s physical
`memory. This can be important because of the persistent
`nature of the data on an NPMU 102. Due to con?guration
`changes, the number of processes accessing a particular
`NPMU 102, or possibly the siZes of their respective alloca
`tions, may change over time. The address translation mecha
`nism allows NPMU 102 to readily accommodate such
`changes without loss of data. The address translation mecha
`nism further allows easy and ef?cient use of persistent
`memory capacity by neither forcing the processor nodes
`104, 106 to anticipate future memory needs in advance of
`
`
`
`US 2005/0132250 A1
`
`Jun. 16, 2005
`
`allocation or forcing the processor nodes 104, 106 to Waste
`persistent memory capacity through pessimistic allocation.
`[0046] With reference again to FIG. 1A, a ServerNet SAN
`operating in its native access validation and translation block
`transfer engine (A VT/BTE) mode is an example of a single
`address space SAN 112. Each target on such SAN presents
`the same, ?at netWork virtual address space to all compo
`nents that issue requests to SAN 112, such as processor
`nodes 104, 106. NetWork virtual address ranges can be
`mapped by the target from PM virtual address to PM
`physical address ranges With page granularity. NetWork PM
`virtual address ranges can be exclusively allocated to a
`single initiator (e.g., processor node 104), and multiple PM
`virtual addresses can point to the same physical page.
`[0047] When processor node 104 requests PMM 140 to
`open (i.e., allocate and then begin to use) a region of
`persistent memory in NPMU 102, NPMU’s NI 114 can be
`programmed by PMM 140 to alloW processor node 104 to
`access the appropriate region. This programming allocates a
`block of netWork virtual addresses and maps (i.e., translates)
`them to a set of physical pages in physical memory. The
`range of PM virtual addresses can be contiguous regardless
`hoW many pages of PM physical address are to be accessed.
`The physical pages can, hoWever, be anyWhere Within the
`PM physical memory. Upon successful set-up of the trans
`lation, NPMU 102 can notify the requesting processor node
`104 of the PM virtual address of the contiguous block. Once
`open, processor node 104 can access NPMU memory pages
`by issuing read or Write operations to NPMU 102. NPMU
`102 can also notify subseque