Placeholder Image

字幕列表 影片播放

  • MIHAI MARUSEAC: I am Mihai.

  • I've been working recently on file systems for TensorFlow.

  • So I'm going to talk about what TensorFlow

  • has in file system support and what

  • are the new changes coming in.

  • First, TensorFlow file system can

  • be used in Python like this.

  • So we can create the directories,

  • we can create files, read or write to them,

  • and we can see what's in a directory.

  • And you can say, is this similar to Python?

  • And if you compare it with Python,

  • it mostly looks the same.

  • So it's just some names changed.

  • There is one difference in mkdir.

  • In Python, the directory must not exist,

  • but there is a flag that changes this.

  • And now that I've said that, they still look similar.

  • You might ask yourself, why does TensorFlow need its own file

  • system implementation?

  • And one of the main reasons comes from the file systems

  • that TensorFlow supports from the formats of the file

  • parameter.

  • In TensorFlow, you can pass a normal path to a file,

  • or it can pass something that looks like a URL

  • or they are usually called URI--

  • uniform resource identifiers.

  • And they are divided into three parts--

  • there's the scheme part, like HTTPS, J3, S3, and so on;

  • the host if it's on a remote host;

  • and then the path, which is like a normal file

  • path on that segment.

  • For TensorFlow, we support multiple schemes.

  • And what we have is a mapping between schemes and a file

  • system implementation.

  • For example, for file, we have a LocalFileSystem.

  • For GS, we have a GoogleCloudFileSystem.

  • With viewfs, Hadoop, and so on.

  • Keep in mind this mapping because this

  • is the core aspect for why TensorFlow needs to have

  • its own system implementation.

  • However, this is not the only case.

  • We need a lot of use cases in TensorFlow code.

  • So beside the basic file I/O that I showed

  • in the first example, we also need to save or load models,

  • we need to checkpoint, we need to dump tensors

  • to file for debugging, We need to parse images

  • and other inputs, and also tf.data datasets,

  • also attaching the file systems.

  • All of this can be implemented in classes,

  • but the best way to implement them

  • would be in a layered approach.

  • You have the base layer where you

  • have the mapping between the scheme and the file system

  • implementation.

  • And then at the top, you have implementation

  • for each of these use cases.

  • And in this talk, I'm going to present all of these layers.

  • But keep in mind that these layers are only

  • for the purposes of this presentation.

  • They grow organically.

  • They are not layered the same way in code.

  • So let's start with the high level API, which

  • is mostly what the users see.

  • It's what they want to do.

  • So when the user wants to load a saved model,

  • the user wants to open a file that

  • contains the saved model, read from the file, load

  • inputs, tensor, and everything.

  • The user would just call a single function.

  • That's why I'm calling this high level API.

  • And in this case, I am going to present some examples of this.

  • For example, API generation.

  • Whenever you are building TensorFlow

  • while building the build package,

  • we are creating several proto buffer

  • files that contain the API symbols that TensorFlow

  • exports.

  • In order to create those files, we

  • are basically calling this function CreateApiDefs,

  • which needs to dump everything into those files.

  • Another example of high level API

  • is DebugFileIO where you can dump tensors into a directory

  • and then you can later review them

  • and debug your application.

  • And there are a few more, like loading saved models.

  • You see that loading saved model needs an export

  • directory to read from.

  • Or more others-- checkpointing, checkpointing

  • of sliced variables across distributed replicas,

  • tf.data datasets, and so on.

  • The question is not how many APIs

  • are available at the high level, but what

  • do they have in common?

  • In principle, we need to write to files, read from files,

  • get statistics about them, create and remove directories,

  • and so on.

  • But we also need to support compression.

  • We also need to support buffered I/O,

  • like read only a part of the file,

  • and then later read another part of the file instead

  • of fitting everything in memory, and so.

  • Most of these implementations come

  • from the next layer, which I am going

  • to call it convenience API.

  • But it's similar to middleware layer in the web application.

  • So it's mostly transforming from the bytes that

  • are on the disk to the information

  • that the user would use in the high level API.

  • Basically, 99% of the use cases are just

  • calling these six functions--

  • reading or writing a string to a file

  • or writing a proto-- either text or binary.

  • However, there are other use cases,

  • like I mentioned before, for streaming

  • and for buffered and compressed I/O where we have this input

  • stream interface class that implements streaming

  • or compression.

  • So we have the SnappyInputBuffer and ZlibInputBuffer

  • read from compressed data, MemoryInputStream

  • and BufferedInputStream are reading in a streamed fashion,

  • and so on.

  • The Inputbuffer class here allows

  • you to read just a single int, from a file

  • and then you can read another int, and then a string

  • and so on.

  • Like, you read chunks of data.

  • All of these APIs at the convenience level

  • are all implemented in the same way in the next layer, which

  • is the low level API.

  • And that's the one that we are mostly interested in.

  • Basically, this level is the one that needs to support multiple

  • platforms, supports all the URI schemes that we currently

  • support, has to support the directory I/O-- so far,

  • I never talked about directory operations in the higher level

  • APIs--

  • and also supports users who can get into this level

  • and creating their own implementations in case

  • they need something that is not implemented so far.

  • If you remember from the beginning,

  • we had this file system registry where

  • you had the scheme of the URI and a mapping

  • between that scheme and the file system implementation.

  • This is implemented as a FileSystem registry

  • class, which is basically a dictionary.

  • You can add a value to the dictionary,

  • you can see what value is at a specific key,

  • or you can seeing all the keys that are in that dictionary.

  • That's all this class does.

  • And it is used in the next class,

  • in the environment class--

  • Env-- which supports the cross-platform support.

  • So we have a WindowsEnv, or a PosixEnv.

  • For Windows when you are compiling on Windows,

  • using TensorFlow on Windows.

  • POSIX when you are using it on the other platforms.

  • And then there are some other Env classes for testing,

  • but let's ignore them for the rest of the talk.

  • The purpose of the Env class is to provide every low level

  • API that the user needs.

  • So, for example, we have the registration-- in this case,

  • like get all the file systems for a file,

  • get all the schemes that are supported,

  • registering a file system.

  • And of a particular notes is the static member default,

  • which allows a developer to write anywhere in the C++ code

  • Env Default and get access to this class.

  • Basically, it's like a single [INAUDIBLE] pattern.

  • So if you need to register a file system somewhere

  • in your function and it's not registered,

  • you just call Env Default register file system.

  • Other functionalities in Env are the actual file system

  • implementation, so creating files.

  • You see there are three types of files.

  • So random access files, writable files,

  • and read-only memory files.

  • The read-only memory regions are files

  • that are mapped in memory on a memory page,

  • and then you can just read directly from memory.

  • There are two ways to write to files.

  • Either you overwrite the entire context,

  • or you append at the end of that.

  • So that's why you have two constructors

  • for writable files-- the NewWritableFile

  • and the NewAppendableFile.

  • More functionalities in Env are creating or removing

  • directories, moving files around, basically everything

  • that is directory operations.

  • Furthermore, the next ones are determining the files

  • existing in your directory, determining all the files that

  • match a specific pattern, or getting information

  • about a specific part entry--

  • if it exists, if it is a directory, what is its size,

  • and so on.

  • All of these are implemented by each individual file system,

  • but I'm going to that soon.

  • There are other informations that Env contains,

  • but they are out of scope for this talk.

  • So Env also supports threading, supports an API for a clock,

  • getting information about your time, loading shared libraries,

  • and so on.

  • We are not concerned about this, but I just

  • wanted to mention them for completeness.

  • As I mentioned, there are three different types

  • that we support--

  • the random access file, the writable file,

  • and the read-only memory region, and this is the current API

  • that they support.

  • The first two files have a name, and then they have operations,

  • like read/write.

  • And the memory region, since it's already mapped in memory,

  • you don't need a name for it.

  • You only need to see how long it is

  • and what is the data in there.

  • These three files, I'm going to come back to them later.

  • That's why I wanted to mention them here.

  • Finally, the last important class

  • is the FileSystem class, which actually

  • does the real implementation.

  • So this is where we implement how

  • to create a file, how to read from a file,

  • and everything else.

  • In order to support TensorFlow in other languages,

  • we also need to provide the C API interface that language

  • bindings can link against.

  • And this is very simple at the moment.

  • It's just providing the same functions

  • that we already saw, except they use C types

  • and they use some other markers in the signature

  • to mark that they are symbols exported in a shared library.

  • This CIP interface is not complete.

  • So for example, it doesn't support random access files.

  • So it doesn't support you reading from files

  • from other languages except if that language [INAUDIBLE]

  • directory over the FileSystem class

  • that I showed you in a previous slide.

  • OK.

  • This is all about file systems that

  • currently exist in our disk in the current implementation.

  • However, there is now work to modernize the TensorFlow file

  • system support in order to reduce our complexity.

  • And when I'm speaking about complexity,

  • I am thinking about this diagram where

  • you have the FileSystem class that I showed you,

  • and all of these implementations of it.

  • So have support for POSIX, support for Windows,

  • support for Hadoop, S3, Gcs, and many others,

  • and then a lot of test file systems.

  • Each one of them is implemented in a different way.

  • Some of them are not compatible.

  • So some of them follow some guidelines, others don't.

  • But the main thing is whenever you build up a package,

  • you need to compile all of this in the binary.

  • Whenever you compile something that

  • implements file system that needs access

  • to the file system, you need to compile all of this.

  • That's something that we want to avoid in the future,

  • and this is how we try to [INAUDIBLE]

  • TensorFlow approach.

  • So indeed, this is the diagram of the world

  • we want to live in.

  • We want to have Core TensorFlow--

  • which has a plugin registry, which

  • is basically the file system registry that I showed before--

  • and we want to also have plugins that implement

  • file system functionality.

  • If you don't need to support Hadoop file system,

  • you won't need to compile the Hadoop file system

  • plugin and so on.

  • The plugin registry, as I said, is similar to the file system

  • registry.

  • So you have the name of the plug-in or the scheme

  • that you want to implement mapped to the plugin that

  • implements the file system.

  • So to summarize, the modular file system

  • goals are to reduce the compile time since you no longer need

  • to provide support for all the file systems

  • that we are currently supporting,

  • you only compile the ones that you need;

  • we also want to provide a full complete C API

  • interface instead of the one that is provided at the moment.

  • We also want to provide an extensive test

  • suite for all the file systems.

  • As soon as somebody develops a new file system,

  • they can run our test suite and see where they fail.

  • So we have a lot of postconditions

  • and preconditions that the file system operations need

  • to support that implemented in this test suite,

  • and whenever somebody implements a new file system,

  • they just test that.

  • Furthermore, because each developer

  • can create its own file system and that

  • is no longer going to be compiled by TensorFlow,

  • we also need to provide some version guarantees.

  • When we change our TensorFlow version,

  • we cannot ask every file system developers to also compile

  • their code.

  • That's why you also need to provide these guarantees.

  • OK, so let's now see how a plugin is going

  • to be loaded into TensorFlow.

  • As I said before, Env has an API that loads

  • a shared library from disk.

  • It can either load all the shared objects that

  • are in some specific directory at TensorFlow runtime startup,

  • or a user can request TensorFlow to load a symbol

  • from a specific library--

  • request TensorFlow to load the shared object

  • from a specific path.

  • In both cases, as soon as the shared object is loaded,

  • TensorFlow Core is going to look for the tf_InitPlugin symbol.

  • That's a symbol that the file system

  • plugin needs to implement because TensorFlow

  • is going to call it.

  • This symbol is the one that the plugins

  • call to register all of their implementations

  • for the file system and send them to TensorFlow.

  • We provide an API interface for plugins

  • that they need to follow.

  • And this interface has classes, has structures

  • with function pointers for every functionality

  • that we currently provide.

  • So we have three file types.

  • We're going to have three function

  • tables for their operations.

  • We'll have one FileSystem class, one more then.

  • This interface is going to have one more function

  • table for the operations that the FileSystem

  • class must support.

  • All of these functions here are documented.

  • They list what are the preconditions

  • and the postconditions, and everything here is just C.

  • Then the next section that we have in our API--

  • in our interface for the plugins must implement--

  • is the metadata for virtually, again, compatibility.

  • For every structure-- so those three structures for the files

  • and one structure for the file system--

  • we have three numbers.

  • One is the API number, and the other one is the ABI number--

  • application binary interface-- and then the third one is

  • the total size of the function table--

  • the total number of function pointers that we

  • provide for that structure.

  • We have two numbers-- the API and ABI in order

  • to consider cases when we accidentally

  • break ABI compatibility.

  • For example-- if I go back if you reorder

  • the offset and the n parameters in the read method,

  • that is an ABI-breaking compatibility

  • because any code that is going to call direct function

  • would expect the offset to be the second parameter

  • of the function and the number of bytes to read the first one.

  • If you swap them, the code is not going to behave properly.

  • This is breaking the binary compatibility.

  • For API compatibility, if you add the new method

  • to the random access files, that's changing the API number.

  • OK.

  • And after the plugins fill in the data structures

  • with their own implementations, they

  • are going to call this function RegisterFilesystemPlugin.

  • However, you see that RegisterFilesystemPlugin

  • has a lot of parameters.

  • It has three metadata informations

  • for four structures, so that's 12 parameters in the beginning,

  • and then the structure with the operations

  • Because we don't want the plugin orders to manually fill

  • in all of these parameters, we provide

  • this TF_REGISTER_FILESYSTEM_PLUGIN

  • macro to which you only pass the structures

  • that you care about--

  • the structures that implement the file system.

  • When TensorFlow Core receives this function call,

  • it does the following steps.

  • Its checks that the scheme argument is a valid string,

  • so it must not be empty or null pointer.

  • It checks that the ABI numbers that the plugin says

  • it was compiled against match the ABI numbers that TensorFlow

  • Core was compiled against.

  • If there is a mismatch between the two,

  • we cannot load the plugin because ABI compatibility is

  • broken.

  • Then, TensorFlow Core checks the API numbers.

  • If there is a mismatch between API number

  • that the plugin says it was compiled against

  • and the API number that the TensorFlow Core was compiled

  • against, we can still load the plugin,

  • but we give a warning to the user

  • because some functionality might be missing.

  • We can safely load the plugin because the required methods

  • are already included in the API at the moment.

  • Then, the next step is to validate

  • that the plugin provided all the required methods.

  • For example, if you provide support

  • for creating random access files,

  • you also need to provide support for reading from them.

  • Finally, if all those validations

  • pass, we copy all the functions tables

  • that the plugin provided, we copy them to Core TensorFlow

  • so we don't need to always go via the interface

  • to the library.

  • And then we initialize and register the file system

  • in TensorFlow, and then everyone else can use it.

  • All the middle level and high level APIs

  • can still function transparently with these changes.

  • They don't need to change at all to convert

  • to the modular TensorFlow world.

  • As I mentioned, we also provide an extensive testing suite

  • where we are creating a structure,

  • we are creating a layout on the directory that we are testing,

  • and then we an [INAUDIBLE] operation.

  • For example, in this test, we are creating a directory.

  • And then we try to determine the file size of that directory.

  • Of course, this should fail because the file size should

  • only be returned if the path that you asked for is a file.

  • So that's why we expect this test to fail.

  • If a file system doesn't support the directory,

  • this test should fail before that with created data being

  • not supported.

  • We have around 2,000 lines of testing code, which

  • I think are testing all the corner cases

  • that file systems can get into.

  • Of course, when you add the new API,

  • then that would require adding more tests.

  • As the results of testing on the POSIX file system

  • implementations, we identified 23 bugs

  • where, for example, you can create a file

  • to read from it where the path is actually a directory.

  • The location of the file succeeds,

  • but then when you try to read from it, the reading fails.

  • Or you can create a directory from a file.

  • As long as you don't add new files to that directory

  • or read from them, the Python API would say, yeah, sure,

  • you can create it.

  • When you try to create something else in that directory,

  • that's when it will fail.

  • Also, we have this FileExists API,

  • which doesn't differentiate at the moment between files

  • and directories.

  • So your path can be a path to a directory

  • and FileExists will still say, yes, that exists and is a file.

  • So we had it in a lot of places after FileExists,

  • we added the check to this directory

  • to make sure that the path that you are expecting to be

  • a directory is a directory.

  • And implementing all of this interface and testing was

  • a good learning of C++.

  • The status at the moment of the modular file system world

  • is we have the POSIX support complete and tested,

  • and I started working on Windows support,

  • hoping to finish it by the end of the year.

  • And then the other file systems that we support

  • will be handled in cooperation with the SIG

  • IO, the special interest group, as they

  • will be offloaded to their repository

  • and won't be any longer in the TensorFlow repository.

  • Once the Windows support is finalized,

  • I will send an email with instructions

  • to developers at TensorFlow on how to test the modular file

  • system support.

  • Once all file systems that are listed here are converted,

  • I'm going to make a change that flips

  • a flag and everything-- all the file system support

  • that TensorFlow provides will be converted to modular world.

  • Besides this, there are some more future feature plans.

  • So for example, there are some corner cases

  • in the glue implementation for Python

  • where the file systems are not consistent with the Python

  • normal file API.

  • There are also APIs in C++, high level APIs in C++ which

  • reimplement some low level API.

  • For example, for dumping tensors to a file

  • for debugging them later, the creator of that API

  • needed a way to recursively create directories.

  • At the time, the file system support in TensorFlow

  • didn't provide that functionality,

  • so the creator just implemented his own recursively created

  • directories inside of the high level API.

  • What we need to do in the future is

  • to clean up this to only have the layered approach

  • as was presented here.

  • Finally, we want to deprecate the Env class to separate

  • the FileSystem implementation from any other concerns

  • that that class has.

  • And at the end, after flipping through the modular TensorFlow

  • world, we want to deprecate all the APIs that

  • are not using this framework.

  • And I think this is all that I wanted to talk about.

  • So now it's open for questions.

  • Also, this is slide 42.

  • Like, the answer for everything.

  • [APPLAUSE]

  • AUDIENCE: So am I missing something in terms of, like,

  • at the beginning, why do we need to have our own file system

  • versus all users use all the Python or C++ code to read

  • or write and do all these things?

  • Can you explain?

  • MIHAI MARUSEAC: Yes.

  • So we have all of these requirements for TensorFlow use

  • cases.

  • So reading from model, writing, and so on.

  • We don't want users to always open their files by themselves,

  • write into the files and so on.

  • We want to provide some API for that.

  • We can subclass the Python classes for file operations

  • to create these APIs, but we also

  • need to subclass those classes for every URI scheme

  • that it will support.

  • So we need to have--

  • let's say we want to read images from several locations.

  • So when it wants to subclass the Python file class to read

  • images with just a single API, one's

  • for reading images from local disk,

  • one's for reading images from a cloud file system,

  • one's for reading images from Hadoop file system, and so on.

  • And whenever you want to add support for your file system,

  • you have to go into all of these classes

  • and add a subclass for every new file system

  • that you want to support.

  • AUDIENCE: I see.

  • So this API is-- oh, if you just use that, that way it

  • will automatically work all these different platforms.

  • Is this what--

  • MIHAI MARUSEAC: Yeah.

  • Basically, it's a different--

  • AUDIENCE: And for other languages that--

  • MIHAI MARUSEAC: And it's also for other languages, yes,

  • because we are doing it in C++ in the C layer.

  • And whenever you create a new language binding,

  • it's already in there.

  • [MUSIC PLAYING]

MIHAI MARUSEAC: I am Mihai.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

TensorFlow內部:TF文件系統 (Inside TensorFlow: TF Filesystems)

  • 2 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字