tarfile —- 读写tar归档文件

源代码:Lib/tarfile.py


The tarfile module makes it possible to read and write tararchives, including those using gzip, bz2 and lzma compression.Use the zipfile module to read or write .zip files, or thehigher-level functions in shutil.

Some facts and figures:

  • reads and writes gzip, bz2 and lzma compressed archivesif the respective modules are available.

  • read/write support for the POSIX.1-1988 (ustar) format.

  • read/write support for the GNU tar format including longname and longlink_extensions, read-only support for all variants of the _sparse extensionincluding restoration of sparse files.

  • read/write support for the POSIX.1-2001 (pax) format.

  • handles directories, regular files, hardlinks, symbolic links, fifos,character devices and block devices and is able to acquire and restore fileinformation like timestamp, access permissions and owner.

在 3.3 版更改: Added support for lzma compression.

  • tarfile.open(name=None, mode='r', fileobj=None, bufsize=10240, **kwargs)
  • Return a TarFile object for the pathname name. For detailedinformation on TarFile objects and the keyword arguments that areallowed, see TarFile Objects.

mode has to be a string of the form 'filemode[:compression]', it defaultsto 'r'. Here is a full list of mode combinations:

模式

action

'r' or 'r:*'

打开和读取使用透明压缩(推荐)。

'r:'

打开和读取不使用压缩。

'r:gz'

打开和读取使用gzip 压缩。

'r:bz2'

打开和读取使用bzip2 压缩。

'r:xz'

打开和读取使用lzma 压缩。

'x' or'x:'

创建tarfile不进行压缩。如果文件已经存在,则抛出 FileExistsError 异常。

'x:gz'

使用gzip压缩创建tarfile。如果文件已经存在,则抛出 FileExistsError 异常。

'x:bz2'

使用bzip2 压缩创建tarfile。如果文件已经存在,则抛出 FileExistsError 异常。

'x:xz'

使用lzma 压缩创建tarfile。如果文件已经存在,则抛出 FileExistsError 异常。

'a' or 'a:'

打开以便在没有压缩的情况下追加。如果文件不存在,则创建该文件。

'w' or 'w:'

Open for uncompressed writing.

'w:gz'

Open for gzip compressed writing.

'w:bz2'

Open for bzip2 compressed writing.

'w:xz'

Open for lzma compressed writing.

Note that 'a:gz', 'a:bz2' or 'a:xz' is not possible. If mode_is not suitable to open a certain (compressed) file for reading,ReadError is raised. Use _mode 'r' to avoid this. If acompression method is not supported, CompressionError is raised.

If fileobj is specified, it is used as an alternative to a file objectopened in binary mode for name. It is supposed to be at position 0.

For modes 'w:gz', 'r:gz', 'w:bz2', 'r:bz2', 'x:gz','x:bz2', tarfile.open() accepts the keyword argumentcompresslevel (default 9) to specify the compression level of the file.

For special purposes, there is a second format for mode:'filemode|[compression]'. tarfile.open() will return a TarFileobject that processes its data as a stream of blocks. No random seeking willbe done on the file. If given, fileobj may be any object that has aread() or write() method (depending on the mode). _bufsize_specifies the blocksize and defaults to 20 * 512 bytes. Use this variantin combination with e.g. sys.stdin, a socket file object or a tapedevice. However, such a TarFile object is limited in that it doesnot allow random access, see 例子. The currentlypossible modes:

模式

动作

'r|*'

打开 tar 块的 以进行透明压缩读取。

'r|'

Open a stream of uncompressed tar blocksfor reading.

'r|gz'

Open a gzip compressed stream forreading.

'r|bz2'

Open a bzip2 compressed stream forreading.

'r|xz'

Open an lzma compressed stream forreading.

'w|'

Open an uncompressed stream for writing.

'w|gz'

Open a gzip compressed stream forwriting.

'w|bz2'

Open a bzip2 compressed stream forwriting.

'w|xz'

Open an lzma compressed stream forwriting.

在 3.5 版更改: 添加了 'x' (仅创建) 模式。

在 3.6 版更改: The name parameter accepts a path-like object.

  • class tarfile.TarFile
  • Class for reading and writing tar archives. Do not use this class directly:use tarfile.open() instead. See TarFile Objects.
  • tarfile.istarfile(_name)
  • Return True if name is a tar archive file, that the tarfilemodule can read.

The tarfile module defines the following exceptions:

  • exception tarfile.TarError
  • Base class for all tarfile exceptions.
  • exception tarfile.ReadError
  • Is raised when a tar archive is opened, that either cannot be handled by thetarfile module or is somehow invalid.
  • exception tarfile.CompressionError
  • Is raised when a compression method is not supported or when the data cannot bedecoded properly.
  • exception tarfile.StreamError
  • Is raised for the limitations that are typical for stream-like TarFileobjects.
  • exception tarfile.ExtractError
  • Is raised for non-fatal errors when using TarFile.extract(), but only ifTarFile.errorlevel== 2.
  • exception tarfile.HeaderError
  • Is raised by TarInfo.frombuf() if the buffer it gets is invalid.

The following constants are available at the module level:

Each of the following constants defines a tar archive format that thetarfile module is able to create. See section Supported tar formats fordetails.

  • tarfile.USTAR_FORMAT
  • POSIX.1-1988 (ustar) format.
  • tarfile.GNU_FORMAT
  • GNU tar format.
  • tarfile.PAX_FORMAT
  • POSIX.1-2001 (pax) format.
  • tarfile.DEFAULT_FORMAT
  • The default format for creating archives. This is currently GNU_FORMAT.

参见

TarFile Objects

The TarFile object provides an interface to a tar archive. A tararchive is a sequence of blocks. An archive member (a stored file) is made up ofa header block followed by data blocks. It is possible to store a file in a tararchive several times. Each archive member is represented by a TarInfoobject, see TarInfo Objects for details.

A TarFile object can be used as a context manager in a withstatement. It will automatically be closed when the block is completed. Pleasenote that in the event of an exception an archive opened for writing will notbe finalized; only the internally used file object will be closed. See the例子 section for a use case.

3.2 新版功能: Added support for the context management protocol.

  • class tarfile.TarFile(name=None, mode='r', fileobj=None, format=DEFAULT_FORMAT, tarinfo=TarInfo, dereference=False, ignore_zeros=False, encoding=ENCODING, errors='surrogateescape', pax_headers=None, debug=0, errorlevel=0)
  • All following arguments are optional and can be accessed as instance attributesas well.

name is the pathname of the archive. name may be a path-like object.It can be omitted if fileobj is given.In this case, the file object's name attribute is used if it exists.

mode is either 'r' to read from an existing archive, 'a' to appenddata to an existing file, 'w' to create a new file overwriting an existingone, or 'x' to create a new file only if it does not already exist.

If fileobj is given, it is used for reading or writing data. If it can bedetermined, mode is overridden by fileobj's mode. fileobj will be usedfrom position 0.

注解

fileobj is not closed, when TarFile is closed.

format controls the archive format. It must be one of the constantsUSTAR_FORMAT, GNU_FORMAT or PAX_FORMAT that aredefined at module level.

The tarinfo argument can be used to replace the default TarInfo classwith a different one.

If dereference is False, add symbolic and hard links to the archive. If itis True, add the content of the target files to the archive. This has noeffect on systems that do not support symbolic links.

If ignore_zeros is False, treat an empty block as the end of the archive.If it is True, skip empty (and invalid) blocks and try to get as many membersas possible. This is only useful for reading concatenated or damaged archives.

debug can be set from 0 (no debug messages) up to 3 (all debugmessages). The messages are written to sys.stderr.

If errorlevel is 0, all errors are ignored when using TarFile.extract().Nevertheless, they appear as error messages in the debug output, when debuggingis enabled. If 1, all fatal errors are raised as OSErrorexceptions. If 2, all non-fatal errors are raised as TarErrorexceptions as well.

The encoding and errors arguments define the character encoding to beused for reading or writing the archive and how conversion errors are goingto be handled. The default settings will work for most users.See section Unicode issues for in-depth information.

The pax_headers argument is an optional dictionary of strings whichwill be added as a pax global header if format is PAX_FORMAT.

在 3.2 版更改: Use 'surrogateescape' as the default for the errors argument.

在 3.5 版更改: 添加了 'x' (仅创建) 模式。

在 3.6 版更改: The name parameter accepts a path-like object.

  • classmethod TarFile.open()
  • Alternative constructor. The tarfile.open() function is actually ashortcut to this classmethod.
  • TarFile.getmember(name)
  • Return a TarInfo object for member name. If name can not be foundin the archive, KeyError is raised.

注解

If a member occurs more than once in the archive, its last occurrence is assumedto be the most up-to-date version.

  • TarFile.getmembers()
  • Return the members of the archive as a list of TarInfo objects. Thelist has the same order as the members in the archive.
  • TarFile.getnames()
  • Return the members as a list of their names. It has the same order as the listreturned by getmembers().
  • TarFile.list(verbose=True, *, members=None)
  • Print a table of contents to sys.stdout. If verbose is False,only the names of the members are printed. If it is True, outputsimilar to that of ls -l is produced. If optional members isgiven, it must be a subset of the list returned by getmembers().

在 3.5 版更改: Added the members parameter.

  • TarFile.next()
  • Return the next member of the archive as a TarInfo object, whenTarFile is opened for reading. Return None if there is no moreavailable.
  • TarFile.extractall(path=".", members=None, *, numeric_owner=False)
  • Extract all members from the archive to the current working directory ordirectory path. If optional members is given, it must be a subset of thelist returned by getmembers(). Directory information like owner,modification time and permissions are set after all members have been extracted.This is done to work around two problems: A directory's modification time isreset each time a file is created in it. And, if a directory's permissions donot allow writing, extracting files to it will fail.

If numeric_owner is True, the uid and gid numbers from the tarfileare used to set the owner/group for the extracted files. Otherwise, the namedvalues from the tarfile are used.

警告

Never extract archives from untrusted sources without prior inspection.It is possible that files are created outside of path, e.g. membersthat have absolute filenames starting with "/" or filenames with twodots "..".

在 3.5 版更改: Added the numeric_owner parameter.

在 3.6 版更改: The path parameter accepts a path-like object.

  • TarFile.extract(member, path="", set_attrs=True, *, numeric_owner=False)
  • Extract a member from the archive to the current working directory, using itsfull name. Its file information is extracted as accurately as possible. member_may be a filename or a TarInfo object. You can specify a differentdirectory using _path. path may be a path-like object.File attributes (owner, mtime, mode) are set unless set_attrs is false.

If numeric_owner is True, the uid and gid numbers from the tarfileare used to set the owner/group for the extracted files. Otherwise, the namedvalues from the tarfile are used.

注解

The extract() method does not take care of several extraction issues.In most cases you should consider using the extractall() method.

警告

See the warning for extractall().

在 3.2 版更改: Added the set_attrs parameter.

在 3.5 版更改: Added the numeric_owner parameter.

在 3.6 版更改: The path parameter accepts a path-like object.

  • TarFile.extractfile(member)
  • Extract a member from the archive as a file object. member may be a filenameor a TarInfo object. If member is a regular file or a link, anio.BufferedReader object is returned. Otherwise, None isreturned.

在 3.3 版更改: Return an io.BufferedReader object.

  • TarFile.add(name, arcname=None, recursive=True, *, filter=None)
  • Add the file name to the archive. name may be any type of file(directory, fifo, symbolic link, etc.). If given, arcname specifies analternative name for the file in the archive. Directories are addedrecursively by default. This can be avoided by setting recursive toFalse. Recursion adds entries in sorted order.If filter is given, itshould be a function that takes a TarInfo object argument andreturns the changed TarInfo object. If it instead returnsNone the TarInfo object will be excluded from thearchive. See 例子 for an example.

在 3.2 版更改: Added the filter parameter.

在 3.7 版更改: Recursion adds entries in sorted order.

  • TarFile.addfile(tarinfo, fileobj=None)
  • Add the TarInfo object tarinfo to the archive. If fileobj is given,it should be a binary file, andtarinfo.size bytes are read from it and added to the archive. You cancreate TarInfo objects directly, or by using gettarinfo().
  • TarFile.gettarinfo(name=None, arcname=None, fileobj=None)
  • Create a TarInfo object from the result of os.stat() orequivalent on an existing file. The file is either named by name, orspecified as a file objectfileobj with a file descriptor.name may be a path-like object. Ifgiven, arcname specifies an alternative name for the file in thearchive, otherwise, the name is taken from fileobj’sname attribute, or the name argument. The nameshould be a text string.

You can modifysome of the TarInfo’s attributes before you add it using addfile().If the file object is not an ordinary file object positioned at thebeginning of the file, attributes such as size may needmodifying. This is the case for objects such as GzipFile.The name may also be modified, in which case _arcname_could be a dummy string.

在 3.6 版更改: The name parameter accepts a path-like object.

  • TarFile.close()
  • Close the TarFile. In write mode, two finishing zero blocks areappended to the archive.
  • TarFile.pax_headers
  • A dictionary containing key-value pairs of pax global headers.

TarInfo Objects

A TarInfo object represents one member in a TarFile. Asidefrom storing all required attributes of a file (like file type, size, time,permissions, owner etc.), it provides some useful methods to determine its type.It does not contain the file's data itself.

TarInfo objects are returned by TarFile's methodsgetmember(), getmembers() and gettarinfo().

  • class tarfile.TarInfo(name="")
  • Create a TarInfo object.
  • classmethod TarInfo.frombuf(buf, encoding, errors)
  • Create and return a TarInfo object from string buffer buf.

Raises HeaderError if the buffer is invalid.

  • classmethod TarInfo.fromtarfile(tarfile)
  • Read the next member from the TarFile object tarfile and return it asa TarInfo object.
  • TarInfo.tobuf(format=DEFAULT_FORMAT, encoding=ENCODING, errors='surrogateescape')
  • Create a string buffer from a TarInfo object. For information on thearguments see the constructor of the TarFile class.

在 3.2 版更改: Use 'surrogateescape' as the default for the errors argument.

A TarInfo object has the following public data attributes:

  • TarInfo.name
  • Name of the archive member.
  • TarInfo.size
  • Size in bytes.
  • TarInfo.mtime
  • 上次修改的时间。
  • TarInfo.mode
  • Permission bits.
  • TarInfo.type
  • File type. type is usually one of these constants: REGTYPE,AREGTYPE, LNKTYPE, SYMTYPE, DIRTYPE,FIFOTYPE, CONTTYPE, CHRTYPE, BLKTYPE,GNUTYPE_SPARSE. To determine the type of a TarInfo objectmore conveniently, use the is*() methods below.
  • TarInfo.linkname
  • Name of the target file name, which is only present in TarInfo objectsof type LNKTYPE and SYMTYPE.
  • TarInfo.uid
  • User ID of the user who originally stored this member.
  • TarInfo.gid
  • Group ID of the user who originally stored this member.
  • TarInfo.uname
  • User name.
  • TarInfo.gname
  • Group name.
  • TarInfo.pax_headers
  • A dictionary containing key-value pairs of an associated pax extended header.

A TarInfo object also provides some convenient query methods:

  • TarInfo.isfile()
  • Return True if the Tarinfo object is a regular file.
  • TarInfo.isdir()
  • Return True if it is a directory.
  • TarInfo.issym()
  • Return True if it is a symbolic link.
  • TarInfo.islnk()
  • Return True if it is a hard link.
  • TarInfo.ischr()
  • Return True if it is a character device.
  • TarInfo.isblk()
  • Return True if it is a block device.
  • TarInfo.isfifo()
  • Return True if it is a FIFO.
  • TarInfo.isdev()
  • Return True if it is one of character device, block device or FIFO.

命令行界面

3.4 新版功能.

The tarfile module provides a simple command-line interface to interactwith tar archives.

If you want to create a new tar archive, specify its name after the -coption and then list the filename(s) that should be included:

  1. $ python -m tarfile -c monty.tar spam.txt eggs.txt

Passing a directory is also acceptable:

  1. $ python -m tarfile -c monty.tar life-of-brian_1979/

If you want to extract a tar archive into the current directory, usethe -e option:

  1. $ python -m tarfile -e monty.tar

You can also extract a tar archive into a different directory by passing thedirectory's name:

  1. $ python -m tarfile -e monty.tar other-dir/

For a list of the files in a tar archive, use the -l option:

  1. $ python -m tarfile -l monty.tar

命令行选项

  • -l <tarfile>
  • —list <tarfile>
  • List files in a tarfile.
  • -c <tarfile> <source1> … <sourceN>
  • —create <tarfile> <source1> … <sourceN>
  • Create tarfile from source files.
  • -e <tarfile> [<output_dir>]
  • —extract <tarfile> [<output_dir>]
  • Extract tarfile into the current directory if output_dir is not specified.
  • -t <tarfile>
  • —test <tarfile>
  • Test whether the tarfile is valid or not.
  • -v, —verbose
  • Verbose output.

例子

How to extract an entire tar archive to the current working directory:

  1. import tarfile
  2. tar = tarfile.open("sample.tar.gz")
  3. tar.extractall()
  4. tar.close()

How to extract a subset of a tar archive with TarFile.extractall() usinga generator function instead of a list:

  1. import os
  2. import tarfile
  3.  
  4. def py_files(members):
  5. for tarinfo in members:
  6. if os.path.splitext(tarinfo.name)[1] == ".py":
  7. yield tarinfo
  8.  
  9. tar = tarfile.open("sample.tar.gz")
  10. tar.extractall(members=py_files(tar))
  11. tar.close()

How to create an uncompressed tar archive from a list of filenames:

  1. import tarfile
  2. tar = tarfile.open("sample.tar", "w")
  3. for name in ["foo", "bar", "quux"]:
  4. tar.add(name)
  5. tar.close()

The same example using the with statement:

  1. import tarfile
  2. with tarfile.open("sample.tar", "w") as tar:
  3. for name in ["foo", "bar", "quux"]:
  4. tar.add(name)

How to read a gzip compressed tar archive and display some member information:

  1. import tarfile
  2. tar = tarfile.open("sample.tar.gz", "r:gz")
  3. for tarinfo in tar:
  4. print(tarinfo.name, "is", tarinfo.size, "bytes in size and is", end="")
  5. if tarinfo.isreg():
  6. print("a regular file.")
  7. elif tarinfo.isdir():
  8. print("a directory.")
  9. else:
  10. print("something else.")
  11. tar.close()

How to create an archive and reset the user information using the _filter_parameter in TarFile.add():

  1. import tarfile
  2. def reset(tarinfo):
  3. tarinfo.uid = tarinfo.gid = 0
  4. tarinfo.uname = tarinfo.gname = "root"
  5. return tarinfo
  6. tar = tarfile.open("sample.tar.gz", "w:gz")
  7. tar.add("foo", filter=reset)
  8. tar.close()

Supported tar formats

There are three tar formats that can be created with the tarfile module:

  • The POSIX.1-1988 ustar format (USTAR_FORMAT). It supports filenamesup to a length of at best 256 characters and linknames up to 100 characters. Themaximum file size is 8 GiB. This is an old and limited but widelysupported format.

  • The GNU tar format (GNU_FORMAT). It supports long filenames andlinknames, files bigger than 8 GiB and sparse files. It is the de factostandard on GNU/Linux systems. tarfile fully supports the GNU tarextensions for long names, sparse file support is read-only.

  • The POSIX.1-2001 pax format (PAX_FORMAT). It is the most flexibleformat with virtually no limits. It supports long filenames and linknames, largefiles and stores pathnames in a portable way. However, not all tarimplementations today are able to handle pax archives properly.

The pax format is an extension to the existing ustar format. It uses extraheaders for information that cannot be stored otherwise. There are two flavoursof pax headers: Extended headers only affect the subsequent file header, globalheaders are valid for the complete archive and affect all following files. Allthe data in a pax header is encoded in UTF-8 for portability reasons.

There are some more variants of the tar format which can be read, but notcreated:

  • The ancient V7 format. This is the first tar format from Unix Seventh Edition,storing only regular files and directories. Names must not be longer than 100characters, there is no user/group name information. Some archives havemiscalculated header checksums in case of fields with non-ASCII characters.

  • The SunOS tar extended format. This format is a variant of the POSIX.1-2001pax format, but is not compatible.

Unicode issues

The tar format was originally conceived to make backups on tape drives with themain focus on preserving file system information. Nowadays tar archives arecommonly used for file distribution and exchanging archives over networks. Oneproblem of the original format (which is the basis of all other formats) isthat there is no concept of supporting different character encodings. Forexample, an ordinary tar archive created on a UTF-8 system cannot be readcorrectly on a Latin-1 system if it contains non-ASCII characters. Textualmetadata (like filenames, linknames, user/group names) will appear damaged.Unfortunately, there is no way to autodetect the encoding of an archive. Thepax format was designed to solve this problem. It stores non-ASCII metadatausing the universal character encoding UTF-8.

The details of character conversion in tarfile are controlled by theencoding and errors keyword arguments of the TarFile class.

encoding defines the character encoding to use for the metadata in thearchive. The default value is sys.getfilesystemencoding() or 'ascii'as a fallback. Depending on whether the archive is read or written, themetadata must be either decoded or encoded. If encoding is not setappropriately, this conversion may fail.

The errors argument defines how characters are treated that cannot beconverted. Possible values are listed in section 错误处理方案.The default scheme is 'surrogateescape' which Python also uses for itsfile system calls, see 文件名,命令行参数,以及环境变量。.

In case of PAX_FORMAT archives, encoding is generally not neededbecause all the metadata is stored using UTF-8. encoding is only used inthe rare cases when binary pax headers are decoded or when strings withsurrogate characters are stored.