TensorFlow內部：TF文件系統 (Inside TensorFlow: TF Filesystems)

字幕列表影片播放

MIHAI MARUSEAC: I am Mihai.
I've been working recently on file systems for TensorFlow.
So I'm going to talk about what TensorFlow
has in file system support and what
are the new changes coming in.
First, TensorFlow file system can
be used in Python like this.
So we can create the directories,
we can create files, read or write to them,
and we can see what's in a directory.
And you can say, is this similar to Python?
And if you compare it with Python,
it mostly looks the same.
So it's just some names changed.
There is one difference in mkdir.
In Python, the directory must not exist,
but there is a flag that changes this.
And now that I've said that, they still look similar.
You might ask yourself, why does TensorFlow need its own file
system implementation?
And one of the main reasons comes from the file systems
that TensorFlow supports from the formats of the file
parameter.
In TensorFlow, you can pass a normal path to a file,
or it can pass something that looks like a URL
or they are usually called URI--
uniform resource identifiers.
And they are divided into three parts--
there's the scheme part, like HTTPS, J3, S3, and so on;
the host if it's on a remote host;
and then the path, which is like a normal file
path on that segment.
For TensorFlow, we support multiple schemes.
And what we have is a mapping between schemes and a file
system implementation.
For example, for file, we have a LocalFileSystem.
For GS, we have a GoogleCloudFileSystem.
With viewfs, Hadoop, and so on.
Keep in mind this mapping because this
is the core aspect for why TensorFlow needs to have
its own system implementation.
However, this is not the only case.
We need a lot of use cases in TensorFlow code.
So beside the basic file I/O that I showed
in the first example, we also need to save or load models,
we need to checkpoint, we need to dump tensors
to file for debugging, We need to parse images
and other inputs, and also tf.data datasets,
also attaching the file systems.
All of this can be implemented in classes,
but the best way to implement them
would be in a layered approach.
You have the base layer where you
have the mapping between the scheme and the file system
implementation.
And then at the top, you have implementation
for each of these use cases.
And in this talk, I'm going to present all of these layers.
But keep in mind that these layers are only
for the purposes of this presentation.
They grow organically.
They are not layered the same way in code.
So let's start with the high level API, which
is mostly what the users see.
It's what they want to do.
So when the user wants to load a saved model,
the user wants to open a file that
contains the saved model, read from the file, load
inputs, tensor, and everything.
The user would just call a single function.
That's why I'm calling this high level API.
And in this case, I am going to present some examples of this.
For example, API generation.
Whenever you are building TensorFlow
while building the build package,
we are creating several proto buffer
files that contain the API symbols that TensorFlow
exports.
In order to create those files, we
are basically calling this function CreateApiDefs,
which needs to dump everything into those files.
Another example of high level API
is DebugFileIO where you can dump tensors into a directory
and then you can later review them
and debug your application.
And there are a few more, like loading saved models.
You see that loading saved model needs an export
directory to read from.
Or more others-- checkpointing, checkpointing
of sliced variables across distributed replicas,
tf.data datasets, and so on.
The question is not how many APIs
are available at the high level, but what
do they have in common?
In principle, we need to write to files, read from files,
get statistics about them, create and remove directories,
and so on.
But we also need to support compression.
We also need to support buffered I/O,
like read only a part of the file,
and then later read another part of the file instead
of fitting everything in memory, and so.
Most of these implementations come
from the next layer, which I am going
to call it convenience API.
But it's similar to middleware layer in the web application.
So it's mostly transforming from the bytes that
are on the disk to the information
that the user would use in the high level API.
Basically, 99% of the use cases are just
calling these six functions--
reading or writing a string to a file
or writing a proto-- either text or binary.
However, there are other use cases,
like I mentioned before, for streaming
and for buffered and compressed I/O where we have this input
stream interface class that implements streaming
or compression.
So we have the SnappyInputBuffer and ZlibInputBuffer
read from compressed data, MemoryInputStream
and BufferedInputStream are reading in a streamed fashion,
and so on.
The Inputbuffer class here allows
you to read just a single int, from a file
and then you can read another int, and then a string
and so on.
Like, you read chunks of data.
All of these APIs at the convenience level
are all implemented in the same way in the next layer, which
is the low level API.
And that's the one that we are mostly interested in.
Basically, this level is the one that needs to support multiple
platforms, supports all the URI schemes that we currently
support, has to support the directory I/O-- so far,
I never talked about directory operations in the higher level
APIs--
and also supports users who can get into this level
and creating their own implementations in case
they need something that is not implemented so far.
If you remember from the beginning,
we had this file system registry where
you had the scheme of the URI and a mapping
between that scheme and the file system implementation.
This is implemented as a FileSystem registry
class, which is basically a dictionary.
You can add a value to the dictionary,
you can see what value is at a specific key,
or you can seeing all the keys that are in that dictionary.
That's all this class does.
And it is used in the next class,
in the environment class--
Env-- which supports the cross-platform support.
So we have a WindowsEnv, or a PosixEnv.
For Windows when you are compiling on Windows,
using TensorFlow on Windows.
POSIX when you are using it on the other platforms.
And then there are some other Env classes for testing,
but let's ignore them for the rest of the talk.
The purpose of the Env class is to provide every low level
API that the user needs.
So, for example, we have the registration-- in this case,
like get all the file systems for a file,
get all the schemes that are supported,
registering a file system.
And of a particular notes is the static member default,
which allows a developer to write anywhere in the C++ code
Env Default and get access to this class.
Basically, it's like a single [INAUDIBLE] pattern.
So if you need to register a file system somewhere
in your function and it's not registered,
you just call Env Default register file system.
Other functionalities in Env are the actual file system
implementation, so creating files.
You see there are three types of files.
So random access files, writable files,
and read-only memory files.
The read-only memory regions are files
that are mapped in memory on a memory page,
and then you can just read directly from memory.
There are two ways to write to files.
Either you overwrite the entire context,
or you append at the end of that.
So that's why you have two constructors
for writable files-- the NewWritableFile
and the NewAppendableFile.
More functionalities in Env are creating or removing
directories, moving files around, basically everything
that is directory operations.
Furthermore, the next ones are determining the files
existing in your directory, determining all the files that
match a specific pattern, or getting information
about a specific part entry--
if it exists, if it is a directory, what is its size,
and so on.
All of these are implemented by each individual file system,
but I'm going to that soon.
There are other informations that Env contains,
but they are out of scope for this talk.
So Env also supports threading, supports an API for a clock,
getting information about your time, loading shared libraries,
and so on.
We are not concerned about this, but I just
wanted to mention them for completeness.
As I mentioned, there are three different types
that we support--
the random access file, the writable file,
and the read-only memory region, and this is the current API
that they support.
The first two files have a name, and then they have operations,
like read/write.
And the memory region, since it's already mapped in memory,
you don't need a name for it.
You only need to see how long it is
and what is the data in there.
These three files, I'm going to come back to them later.
That's why I wanted to mention them here.
Finally, the last important class
is the FileSystem class, which actually
does the real implementation.
So this is where we implement how
to create a file, how to read from a file,
and everything else.
In order to support TensorFlow in other languages,
we also need to provide the C API interface that language
bindings can link against.
And this is very simple at the moment.
It's just providing the same functions
that we already saw, except they use C types
and they use some other markers in the signature
to mark that they are symbols exported in a shared library.
This CIP interface is not complete.
So for example, it doesn't support random access files.
So it doesn't support you reading from files
from other languages except if that language [INAUDIBLE]
directory over the FileSystem class
that I showed you in a previous slide.
OK.
This is all about file systems that
currently exist in our disk in the current implementation.
However, there is now work to modernize the TensorFlow file
system support in order to reduce our complexity.
And when I'm speaking about complexity,
I am thinking about this diagram where
you have the FileSystem class that I showed you,
and all of these implementations of it.
So have support for POSIX, support for Windows,
support for Hadoop, S3, Gcs, and many others,
and then a lot of test file systems.
Each one of them is implemented in a different way.
Some of them are not compatible.
So some of them follow some guidelines, others don't.
But the main thing is whenever you build up a package,
you need to compile all of this in the binary.
Whenever you compile something that
implements file system that needs access
to the file system, you need to compile all of this.
That's something that we want to avoid in the future,
and this is how we try to [INAUDIBLE]
TensorFlow approach.
So indeed, this is the diagram of the world
we want to live in.
We want to have Core TensorFlow--
which has a plugin registry, which
is basically the file system registry that I showed before--
and we want to also have plugins that implement
file system functionality.
If you don't need to support Hadoop file system,
you won't need to compile the Hadoop file system
plugin and so on.
The plugin registry, as I said, is similar to the file system
registry.
So you have the name of the plug-in or the scheme
that you want to implement mapped to the plugin that
implements the file system.
So to summarize, the modular file system
goals are to reduce the compile time since you no longer need
to provide support for all the file systems
that we are currently supporting,
you only compile the ones that you need;
we also want to provide a full complete C API
interface instead of the one that is provided at the moment.
We also want to provide an extensive test
suite for all the file systems.
As soon as somebody develops a new file system,
they can run our test suite and see where they fail.
So we have a lot of postconditions
and preconditions that the file system operations need
to support that implemented in this test suite,
and whenever somebody implements a new file system,
they just test that.
Furthermore, because each developer
can create its own file system and that
is no longer going to be compiled by TensorFlow,
we also need to provide some version guarantees.
When we change our TensorFlow version,
we cannot ask every file system developers to also compile
their code.
That's why you also need to provide these guarantees.
OK, so let's now see how a plugin is going
to be loaded into TensorFlow.
As I said before, Env has an API that loads
a shared library from disk.
It can either load all the shared objects that
are in some specific directory at TensorFlow runtime startup,
or a user can request TensorFlow to load a symbol
from a specific library--
request TensorFlow to load the shared object
from a specific path.
In both cases, as soon as the shared object is loaded,
TensorFlow Core is going to look for the tf_InitPlugin symbol.
That's a symbol that the file system
plugin needs to implement because TensorFlow
is going to call it.
This symbol is the one that the plugins
call to register all of their implementations
for the file system and send them to TensorFlow.
We provide an API interface for plugins
that they need to follow.
And this interface has classes, has structures
with function pointers for every functionality
that we currently provide.
So we have three file types.
We're going to have three function
tables for their operations.
We'll have one FileSystem class, one more then.
This interface is going to have one more function
table for the operations that the FileSystem
class must support.
All of these functions here are documented.
They list what are the preconditions
and the postconditions, and everything here is just C.
Then the next section that we have in our API--
in our interface for the plugins must implement--
is the metadata for virtually, again, compatibility.
For every structure-- so those three structures for the files
and one structure for the file system--
we have three numbers.
One is the API number, and the other one is the ABI number--
application binary interface-- and then the third one is
the total size of the function table--
the total number of function pointers that we
provide for that structure.
We have two numbers-- the API and ABI in order
to consider cases when we accidentally
break ABI compatibility.
For example-- if I go back if you reorder
the offset and the n parameters in the read method,
that is an ABI-breaking compatibility
because any code that is going to call direct function
would expect the offset to be the second parameter
of the function and the number of bytes to read the first one.
If you swap them, the code is not going to behave properly.
This is breaking the binary compatibility.
For API compatibility, if you add the new method
to the random access files, that's changing the API number.
OK.
And after the plugins fill in the data structures
with their own implementations, they
are going to call this function RegisterFilesystemPlugin.
However, you see that RegisterFilesystemPlugin
has a lot of parameters.
It has three metadata informations
for four structures, so that's 12 parameters in the beginning,
and then the structure with the operations
Because we don't want the plugin orders to manually fill
in all of these parameters, we provide
this TF_REGISTER_FILESYSTEM_PLUGIN
macro to which you only pass the structures
that you care about--
the structures that implement the file system.
When TensorFlow Core receives this function call,
it does the following steps.
Its checks that the scheme argument is a valid string,
so it must not be empty or null pointer.
It checks that the ABI numbers that the plugin says
it was compiled against match the ABI numbers that TensorFlow
Core was compiled against.
If there is a mismatch between the two,
we cannot load the plugin because ABI compatibility is
broken.
Then, TensorFlow Core checks the API numbers.
If there is a mismatch between API number
that the plugin says it was compiled against
and the API number that the TensorFlow Core was compiled
against, we can still load the plugin,
but we give a warning to the user
because some functionality might be missing.
We can safely load the plugin because the required methods
are already included in the API at the moment.
Then, the next step is to validate
that the plugin provided all the required methods.
For example, if you provide support
for creating random access files,
you also need to provide support for reading from them.
Finally, if all those validations
pass, we copy all the functions tables
that the plugin provided, we copy them to Core TensorFlow
so we don't need to always go via the interface
to the library.
And then we initialize and register the file system
in TensorFlow, and then everyone else can use it.
All the middle level and high level APIs
can still function transparently with these changes.
They don't need to change at all to convert
to the modular TensorFlow world.
As I mentioned, we also provide an extensive testing suite
where we are creating a structure,
we are creating a layout on the directory that we are testing,
and then we an [INAUDIBLE] operation.
For example, in this test, we are creating a directory.
And then we try to determine the file size of that directory.
Of course, this should fail because the file size should
only be returned if the path that you asked for is a file.
So that's why we expect this test to fail.
If a file system doesn't support the directory,
this test should fail before that with created data being
not supported.
We have around 2,000 lines of testing code, which
I think are testing all the corner cases
that file systems can get into.
Of course, when you add the new API,
then that would require adding more tests.
As the results of testing on the POSIX file system
implementations, we identified 23 bugs
where, for example, you can create a file
to read from it where the path is actually a directory.
The location of the file succeeds,
but then when you try to read from it, the reading fails.
Or you can create a directory from a file.
As long as you don't add new files to that directory
or read from them, the Python API would say, yeah, sure,
you can create it.
When you try to create something else in that directory,
that's when it will fail.
Also, we have this FileExists API,
which doesn't differentiate at the moment between files
and directories.
So your path can be a path to a directory
and FileExists will still say, yes, that exists and is a file.
So we had it in a lot of places after FileExists,
we added the check to this directory
to make sure that the path that you are expecting to be
a directory is a directory.
And implementing all of this interface and testing was
a good learning of C++.
The status at the moment of the modular file system world
is we have the POSIX support complete and tested,
and I started working on Windows support,
hoping to finish it by the end of the year.
And then the other file systems that we support
will be handled in cooperation with the SIG
IO, the special interest group, as they
will be offloaded to their repository
and won't be any longer in the TensorFlow repository.
Once the Windows support is finalized,
I will send an email with instructions
to developers at TensorFlow on how to test the modular file
system support.
Once all file systems that are listed here are converted,
I'm going to make a change that flips
a flag and everything-- all the file system support
that TensorFlow provides will be converted to modular world.
Besides this, there are some more future feature plans.
So for example, there are some corner cases
in the glue implementation for Python
where the file systems are not consistent with the Python
normal file API.
There are also APIs in C++, high level APIs in C++ which
reimplement some low level API.
For example, for dumping tensors to a file
for debugging them later, the creator of that API
needed a way to recursively create directories.
At the time, the file system support in TensorFlow
didn't provide that functionality,
so the creator just implemented his own recursively created
directories inside of the high level API.
What we need to do in the future is
to clean up this to only have the layered approach
as was presented here.
Finally, we want to deprecate the Env class to separate
the FileSystem implementation from any other concerns
that that class has.
And at the end, after flipping through the modular TensorFlow
world, we want to deprecate all the APIs that
are not using this framework.
And I think this is all that I wanted to talk about.
So now it's open for questions.
Also, this is slide 42.
Like, the answer for everything.
[APPLAUSE]
AUDIENCE: So am I missing something in terms of, like,
at the beginning, why do we need to have our own file system
versus all users use all the Python or C++ code to read
or write and do all these things?
Can you explain?
MIHAI MARUSEAC: Yes.
So we have all of these requirements for TensorFlow use
cases.
So reading from model, writing, and so on.
We don't want users to always open their files by themselves,
write into the files and so on.
We want to provide some API for that.
We can subclass the Python classes for file operations
to create these APIs, but we also
need to subclass those classes for every URI scheme
that it will support.
So we need to have--
let's say we want to read images from several locations.
So when it wants to subclass the Python file class to read
images with just a single API, one's
for reading images from local disk,
one's for reading images from a cloud file system,
one's for reading images from Hadoop file system, and so on.
And whenever you want to add support for your file system,
you have to go into all of these classes
and add a subclass for every new file system
that you want to support.
AUDIENCE: I see.
So this API is-- oh, if you just use that, that way it
will automatically work all these different platforms.
Is this what--
MIHAI MARUSEAC: Yeah.
Basically, it's a different--
AUDIENCE: And for other languages that--
MIHAI MARUSEAC: And it's also for other languages, yes,
because we are doing it in C++ in the C layer.
And whenever you create a new language binding,
it's already in there.
[MUSIC PLAYING]