Placeholder Image

字幕列表 影片播放

  • I'm glad you came to hear about the new parser coming to Node.

  • About who I am, I'm Fedor Indutny.

  • Here's my Twitter and GitHub handles which are the same.

  • And I write code at PayPal.

  • You might know me by this dark avatar I use on GitHub and Twitter and practically everywhere

  • else.

  • For today talk, the slides and the presentations are already up online.

  • If you would like, you can scan the QR code and open your browser and follow along as

  • I'm going to present the topic.

  • Which is llhttp.

  • The new HTTP protocol parser for Node.js.

  • This has deep roots into the history of Node.js.

  • And it would be hard if not impossible not to mention the history while describing.

  • Of course, Node.js is used to load for the front-end tooling.

  • But originally historically it started as the backend platform.

  • It was exclusively about building asynchronous HTTP servers.

  • And the creator of Node, HTTP parser is what started as node.

  • There's presentations on the importance and role and structure.

  • It's a power Node.js event loop and it's key ever since they set up the node.

  • And you might remember if there's a dependency, however, it is not.

  • As an HTTP parser, it has all the dependencies that Node ever had.

  • It leads all the dependencies in Node.

  • Initially it was inspired by mongrel, a Ruby web server with its own parser created by

  • web show.

  • And later, parts Nginx code introduced in HTTP parser, the

  • original one.

  • And, of course, the code itself.

  • The parser has been with us since 2009.

  • It's been more than ten years.

  • And in effect, Node itself has been introduced that the very conference ten years ago by

  • Ryan.

  • So, it's kind of a jubilee for both of those projects.

  • And I wrote another HTTP parser to replace the original parser.

  • So, why would anyone want to get rid of such a library?

  • Of course, it's a fantastic library.

  • It has been with us for ten years.

  • It should have been worked fine.

  • And it has many users inside of the Node.js community as well.

  • For example, Android proxy by Google.

  • They use it for parsing HTTP requests and it's quite a popular project as well.

  • What makes this parser great and why did it stay?

  • First, good enough performance.

  • It takes quite a bit of time to invoke a C function, and parser is written in C. It takes

  • way less time to parse the requests.

  • I'm going to elaborate on it a bit later in this presentation.

  • But for NodeJS purposes, it's very good performance.

  • Couldn't be better.

  • It also supports a lot of clients and servers that violate the HTTP specification.

  • There are way too many of them out on the Internet.

  • You would be surprised how many.

  • And, of course, it was very important for early adoption of NodeJS.

  • Because in 2009 there was even more such clients and servers out there.

  • Another point is that original parser has a lot of test suites.

  • So, over ten years Ryan, Ben and other maintainers of the project, including myself, have wrote

  • quite a comprehensive test suite that covers almost every aspect of HTTP specification.

  • So, the parser is welltested.

  • So, that was the good points of the original parser.

  • Now we come to other points.

  • Unfortunately, with the HTTP library the code became quite rigid.

  • It became impossible to move things around, to make significant changes to it.

  • And as a consequence of this, it became impossible to maintain this library efficiently.

  • Furthermore, as one of the maintainers of the project, I have to constantly relearn

  • and get familiar with the parts of the codebase that I was previously familiar with before.

  • And I did it on every pull request and still I wasn't sure if the code is going to run

  • in the way I expected it to run.

  • So, it could introduce some unexpected behavior.

  • Or maybe a security vulnerability as well.

  • Which is obviously bad.

  • It doesn't help either that most NodeJS users and developers are familiar with JavaScript

  • and are more comfortable with it than they are with C. So, there was not too many people

  • interested in working on this HTTP parser.

  • With all this in mind, several years ago I set out on a quest it make the library better

  • and maybe a bit faster in the process.

  • First attempts were quite conservative.

  • So, I tried to stay with existing code as much as possible.

  • And some of them were successful.

  • Like replacing the parser state machine with a cross, and using it consistently not only

  • improved the code, but also made it faster.

  • Which was nice.

  • Other attempts were a complete disaster.

  • I tried to move those states into a separated function and just called them from the loop

  • so each state would return the next state that was supposed to be executed and then

  • the loop would run the function for it.

  • This completely ruins the performance and I never contributed or sent it to the string.

  • From this attempt was success or failure, the conclusion was clear.

  • It was hard to make an improvement while staying with the existing codebase.

  • There was no requirement to use as much code as possible, it was no longer required to

  • make it in the same programming language as before.

  • Why not develop it in JavaScript or maybe TypeScript instead?

  • And, of course, JavaScript performance is quite great.

  • But you wouldn't be surprised that it's slower than C and takes a lot of effort to reach

  • comparable performances in the C libraries when you write programs in JavaScript.

  • Takes a lot of effort.

  • But it's possible.

  • So, I wanted to get this performance optimization out of consideration.

  • And also, I decided to just define the parser in TypeScript and still compile it down to

  • the C library.

  • So, the end result would be C library that was used in Node and other projects.

  • Which was great because existing users of HTTP parser this way would be able to transition

  • their code to the new parser and hopefully the process and the performance would not

  • degrade so much because in the end it's the same programming language.

  • It would have good chances of being the same speed.

  • So, llhttp is the next major version of HTTP parser.

  • It has the same core principles and similar API, which is almost identical.

  • And the way they work, is they scan one character at a time.

  • And during the process they change the internal state and they could add a header fields or

  • maybe header values and later on body.

  • I'm not sure if we're going to wait for this.

  • It's quite slow.

  • Okay.

  • Yeah.

  • I think you probably understand what it means now.

  • At least to some extent.

  • So, this can by the virtue of this one scan the parser can work without buffering anything

  • at all.

  • So, it doesn't allocate memory itself.

  • And it creates especially for request bodies because it could just need the slices of original

  • buffers that came from that work instead of allocating the coping data.

  • So, in the core principle of the HTTP parser, it's not copying.

  • That's important.

  • And as soon as any amount of data arrives via a request or a single byte of request,

  • HTTP parser is ready to process it.

  • And it will be possibly partial because it will be health requests.

  • And the header names, we are just seeing in this animation.

  • In the original version of the parser, this scan was quite naturally implemented foreloop

  • over the input.

  • Just going by the input bytebybyte and doing some syncs.

  • What it did, it was described by a huge which statement or all possible parsing state.

  • Whether it's a header name, header value, whether it's value of content lengths header,

  • it was all described by the switch statement, and they represented different states of the

  • state machine.

  • All of this lived in a single function or 1,000 lines of code which is quite a terrible

  • idea.

  • So, an obvious improvement would be to break this switch into pieces and make it such that

  • each piece has precise action.

  • It's sort of a unique philosophy, I guess.

  • So, it would be just exactly about doing one small thing at a time.

  • Go to statements would be used to jump between states.

  • There would not be as much need for this foreloop, at least not as much.

  • With all this in mind, how do I approach this process?

  • I developed a domainspecific language, DSL, and created a compiler around it, LL parse,

  • so, again, double L. And this compiler is used to describe the parsing states in terms

  • of these actions that they perform.

  • So, each state would have several actions assigned to them and they would perform them

  • and move on to the next state.

  • Because of this llparse is quite a general compiler, it can be used for other protocols

  • as well.

  • It works better for textual protocols, but I think it's useful.

  • Original parser suffered from a surplus of handwritten code.

  • I have selected a few actions that were repeated most in the original library and had DSL around

  • them.

  • The idea here is that I wanted the description of the new parser to be concise.

  • So, I wanted to write code with less lines and less signals than possible.

  • I wanted to move the most common iterations inside of the compiler so that the rest would

  • have to do the work all over again and the original parser.

  • Here were a few that the compiler supports.

  • One is match.

  • It takes a sequence, or a character of bytes and it tries to match them from the input.

  • For example, it could be keep alive, which is the value of the headernamed connection.

  • It's quite a common header and very important.

  • So, it could match this sequence.

  • And when it does, move to the next state by taking the reference to this state.

  • And other times the parser needs to check the next character in the incoming data without

  • actually consuming it.

  • And it could be used for this.

  • Takes a single character and mentions it and moves on without moving the input stream.

  • And speaking of headers, headers like lengths, they have internal values which frankly described

  • by strings, it's a contextual protocol.

  • The new parser has to be able to parse the integer strings.

  • And the way I decided to do it was to implement a select method in DSL which takes a map as

  • an input.

  • And this map has sequences or characters as keys.

  • And the integers as values.

  • So, it tries to map this sequence to the integers and just passes values along to the next state.

  • The next state could be storing integers inside some property of the structure, or maybe walking

  • a user code back with them.

  • Speaking of callbacks, there is one special type of callback.

  • It is very important in the life span of both original and the new HTTP parser.

  • During their executions they need ranges of data.

  • For example, header names, or headers are in this way.

  • And begin that we have a stream of data that comes to the parser, we have to be able to

  • mark some certain place inside of this stream as the beginning of this range.

  • And then at some other point later on we want to set it as an ending of the range.

  • So, between those beginning and ending, or you can see, our beginning and ending.

  • Between them the callback is going to be invoked for every byte.

  • And it's really, really useful for header names, header values and bodies, and other

  • things that could be needed that spans the ranges of input.

  • Of course, there are a couple of important actions that I have not needed in previous

  • slides that are actually mandatory to have in this state.

  • They call otherwise and skip to.

  • And those specify which states of the parser should be reached next if nothing else matches

  • inside of the current state.

  • So, in this example, the input would be A. The parser would move to the A state, and

  • for B, move to B and C or D or E or whatever is later.

  • It would move to some other state.

  • Skip to is quite similar.

  • It's the same thing but consumes the character from the input stream.

  • Otherwise, it does not change the input stream at all.

  • It just moves on.

  • So, that was a bit of description maybe too concise to be useful of DSL.

  • And with this DSL in mind, llhttp becomes a type Script program.

  • This program uses it to describe the actions and input as said before.

  • Because it's a TypeScript program, or JavaScript program, really, I could split it into several

  • sub modules and use them efficiently.

  • And each submodule could have the subparser.

  • This is what I use in HTTP.

  • I have a separate parser inside of it and can use it and run it separately and can be

  • used separately as well as a library if anyone wants it.

  • Llparse compile this is TypeScript program down to C. And that's the main action of it.

  • Know that because it uses a stable DSL, oh, sorry, just uses DSL, the parser doesn't need

  • to do any parsing.

  • It's done automatically by the JS engine.

  • So, V8 does it for us.

  • V8 does this itself internally.

  • Llparse builds a graph of state which I will try to show you.

  • It looks kind of terrible.

  • But I probably can zoom in.

  • Yeah.

  • So, yeah, hear, how it looks like in practice.

  • I can probably actually show you something more useful.

  • So, here on the right you see ACL buy check out this.

  • Mobile names supported by parser.

  • And the name is matched in this of the input.

  • It will store the integer and coding as a method inside those internal properties of

  • parsers.

  • It works this way more or less.

  • I guess with a that means is the kind of graph is looking awesome.

  • So, that's one of the reasons to have it.

  • And another reason to have it is that llparse can do static analysis on this graph.

  • Before in original parsers there was no way to reason about the states automatically.

  • So, it was all described manually inside of the parser.

  • There was just a lot of C code.

  • And there was no way to analyze it and to do any optimizations for checks.

  • What I can to in llparse is check the code for absence of infant loops which is possible

  • due to otherwise so, they could be combined together.

  • Not just making any progress over the input and just being in indefinitely.

  • It's important to check this statically because there might be an array of conditions that

  • could trigger it.

  • It might not be immediately verifiable with the test suite.

  • Speaking of optimizations, the parser could do peephole optimizations.

  • That's a fancy name for just combining similar states together into one.

  • This way the amount of code is reduced, and also the compiler sorry, programmer has a

  • lot more freedom and development with the parser because they don't have to think about

  • doing these optimizations manually in their code.

  • So, as I said, the DSL is quite stable for llparse.

  • So, same program could be compared to different outputs.

  • At this moment, C and the code is wellsupported.

  • And the code was kind of before, it was supported before the C. So, that's the reason why both

  • projects have double L in their names.

  • Kind of a historical reason.

  • But in the end when I ran benchmarks, this tree C compiler worked better than the code

  • which I spent several weeks on.

  • It was surprising for me, maybe shocking.

  • I had a few months to think about it.

  • Yeah.

  • And speaking of performance, you might be wondering how fast this new parser is given

  • that it is not handwritten at all.

  • But, of course, despite the maintenance issues, as I said before, the original parser has

  • quite a bit of improvements.

  • They're good enough, still very good.

  • It would have been very unreasonable to replace it with something that performed significantly

  • worse.

  • So, ll shipping is not handwritten, it's not hand optimized.

  • And when I ran benchmarks, I discovered that it runs twice as fast as the original parser.

  • [ Applause ] Thank you.

  • And here is the actual benchmark numbers.

  • So, as you can see, both those numbers did quite well.

  • As it presents a number of requests per second that each parser can take.

  • And this way both parsers run in microseconds.

  • You wouldn't ever see this on the profile logs.

  • Just impossible to see it.

  • And, of course, it's important to note as well that llhttp is more than two times faster

  • than the original parser.

  • But no one hardly cares about it.

  • This llhttp parser is a default in Node version 12.

  • If you are using the latest Node, I hope you do, you are already running this.

  • And even more, you have a reason to blame me for any kinds of HTTP problems that you

  • have.

  • So, feel free to do so and it would be rightful.

  • Please open GitHub.

  • Don't just brag on Twitter how bad I am.

  • And tag me on GitHub.

  • I will be happy to look into it and fix it as soon as possible.

  • As I said before, over ten years originally the parser has accumulated a comprehensive

  • test suite.

  • It would be really unwise to get rid of it and just start it over.

  • So, I ported all the original tests to mark down.

  • And the new ones.

  • And here they are described by the markdown files.

  • So, they completed, and I encourage you to take a look at them if you want.

  • Because I find them the most I'm using part of the project that worked on.

  • Yes.

  • They're really, really fun.

  • I'm going to give a bit more time because I see that some people scanned

  • the codes.

  • Okay.

  • So, that makes it.

  • And each markdown file contains several tests.

  • They are separated by the headings.

  • So, it's quite easy to read.

  • And even easier to contribute.

  • In fact, they had one contributor that submitted a test without asking a question how to do

  • so.

  • And I don't know that many projects that uses a test system that's easy to contribute.

  • And furthermore, each test has an actual description inside of them.

  • They have code chunks inside of markdown, code chunk for expected output, and between

  • those you can put just any text you want.

  • You can put hyperlinks, you can put images.

  • You can put GIFs of pets doing stuff.

  • So, it's quite a nice way of writing tests.

  • And it especially works well for HTTP because it's the contextual protocol.

  • And you can see in C where it's actually inside the test suite.

  • Yeah.

  • So, here's been the presence of the parser and the history of it.

  • So, naturally I would like to talk about what's coming next.

  • And there is room for future improvements with regards to doing more static checks in

  • this graph I'm showing to you and doing more performance optimizations.

  • There are forks of original parsers that use CPUspecific vector instructions which are

  • really, really, fast.

  • They perform better than both.

  • Not much use for this speed bump for Node.

  • But nevertheless, there's no reason to not explore it.

  • I probably should do it, or I would love someone do it for me to do some.

  • Quite ashamedly, the project is not well documented.

  • It has quite a bit of information describing the API of llparse, but it's scattered through

  • several dependencies and could be made more accessible.

  • I could use help with this and I'm sure some people might find this quite entertaining

  • as well because there's a lot of interesting moments in this parser, in this code.

  • Finally, it would be fascinating to see different types of parsers implemented on top of llparse.

  • Two examples that I have in mind, SMTP and protocols, they are contextual and could be

  • implemented in llparse.

  • I really look forward to either working on this or helping someone with this.

  • And, yeah.

  • Here is a link to the ll repos.

  • It's quite easy to find on GitHub.

  • Just under the Node.js organization.

  • GitHub.com/Node.js/ et cetera /llhttp.

  • So, I guess that's it.

  • Thank you so much for listening.

I'm glad you came to hear about the new parser coming to Node.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

llhttp--Node.js的新HTTP 1.1解析器,由Fedor Indutny撰寫|JSConf EU 2019 (llhttp - new HTTP 1.1 parser for Node.js by Fedor Indutny | JSConf EU 2019)

  • 2 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字