So, today what I'd like to do is to relate the
importance of big data in healthcare to what we're
talking about today, identify some of the
critical steps to make data useful so when you think of
electronic health record data or secondary use of
existing data, there is a lot that has to be done to
make it useable for purposes of research.
Look at some of the principles of big data
analytics and then talk about some examples of some
of the science, and you'll hear a lot more about that
during the week in terms of more in depth on that.
So, when we think about big data science, it's really
the application of mathematical algorithms to
large data sets to infer probabilities
for prediction.
That's the very simple definition.
You'll hear a number of other definitions as you go
through the week as well.
And the purpose is really to find novel patterns in data
to enable data driven decisions.
I think as we continue to progress with big data
science, we won't only find novel patterns but in fact
we'll be able to do much more of being able to
demonstrate hypothesis.
One of my students was at a big data conference that
Mayo University in Minnesota was putting on, and one of
the things that they're starting to do now is to
replicate clinical trials using big data, and they're
in some cases able to come up with results that are 95
percent similar to having done the clinical
trials themselves.
So we're going to be seeing a real shift in the use of
big data in the future.
So when I think about big data analytics, what this
picture's really portraying is big data analytics exists
on a continuum for clinical translational science from
T1 to T4 where there's foundational types of work
that need to be done but we actually need to apply the
results in clinical practice and to learn from clinical
practice that it then informs foundational
science again.
When you look at the middle of this picture, what this
is really showing is that this is really what nursing
is about.
If you look at the ANA's scope and standards of
practice on the social policy statements, nursing
is really about protecting, promoting health and then to
alleviate suffering.
So when we focus on -- when we think about big data
science in nursing, that's really kind of our area
of expertise.
And what you see on the bottom of this graph is it's
really about when we move from data, you know, we
don't lack data.
We lack information and knowledge and so it's really
about how we transform data into information into
knowledge, and then the wise use of that information
within practice itself.
This was, I -- we were doing a conference back in
Minnesota on big data and I happened to run into this
graphic that just, you know, it's like how fast is data
growing nowadays?
And so what you can see is data flows so fast that the
total accumulation in the past two years is a zeta byte.
And I'm like, "Well, what is a zeta byte?"
A zeta byte is a one with 21 zeroes after it.
And that what you can see is the amount of data that
we've accumulated in the last two years equals all
the total information in the last century.
So the rate of growth of data is getting to be huge.
Data by itself though, isn't sufficient.
It really needs to be able to be transferred or
transformed into information and knowledge.
Well, when we think about healthcare, what we can see
is that the definition is that it's a large volume,
but it might not be large volume.
So when you think about genomics sometimes it's not
a large volume, but it's very complex data, and that
as we think about getting beyond genomics and we think
about where we're at, it's really looking at where are
all the variety of data sources and, it's the
integration of multiple datasets that we're really
running into now.
And it's data that accumulates over time, so
it's ever changing and the speed of it is
ever changing.
What you can see in the right-hand corner here is
that there -- as we think about the new health
sciences and data sources, genomics is a really
critical piece, but the electronic health record,
patient portals, social media, the drug research
test results, all the monitoring and censoring
technology and more recently adding in geocoding.
So as we think about geocoding, it's really the
ability to pinpoint the latitude and longitude of
where patients exist.
It's a more precise way of looking at the geographical
setting in which patients exist, and that there's a
lot of secondary data then around geocodes that can
give us background information about
neighborhoods that include such things as, you know,
looking at financial class, education.
Now it doesn't mean that it always applies to me,
because I might be an odd person in a neighborhood,
but it gives us more background information that
we may not be able to get from other resources.
So, big data is really about volume, velocity, voracity
as Dr. Grady pointed out earlier today.
Now as we think about big data, 10 years ago when I
went to the University of Minnesota and my Dean,
Connie Delaney [phonetic sp] had talked about doing data
mining and I thought, "Oh, that sounds
really interesting."
Because I was in the software business before and
our whole goal was to collect data in a
standardized way that can be reused for purposes of
research and quality improvement.
I just didn't know what to do with it once I got it.
And so I've had the fortune to work with data miners.
We have a large computer science department that does
internationally known for its data mining, and a lot
of that work was funded primarily by the National
Science Foundation at that time because it was really
about methodologies.
Well now we're starting to see big data science being
funded much more mainstream in addition now, NIH, CTSA,
et cetera, are all working on how do we fund the
knowledge, the new methodologies that we need
in terms of big data science?
So, an example of some of the big data science that
really is funded already today is that if we look at
our CTSAs.
So, there's 61-plus CTSA clinical translational
science awards across the country and the goal is to
be able to share methodologies, to have
clinical data repositories and clinical data
warehouses, and then to begin to start to say, "How
do we do some research that goes across these CTSAs?
How do we collaborate together?"
Or as we look at PCORnet.
PCORnet is another example.
So as we think about, there are 11 clinical data
research networks -- this may have increased by now --
as well as 18 patient powered research networks.
We happen to participate in one that has 10 different
academic of healthcare systems working together,
and it means that for our data warehouse we have to
have a common data model with common data standards
with common data queries in order to be able to look at
research such as we're looking at ALS, obesity, and
breast cancer.
And wouldn't it be nice if we could look at some of the
signs and symptoms that nurses are interested in
addition to looking at specific kinds of diseases?
When we look at some of the work that Optum health as
well as other insurance companies, they're really
beginning to take a look at amassing large datasets.
So Optum Labs happens to have 140 million lives from
claims data, and they're adding in 40 million lives
from electronic health records, so that provides
really large data sets for us to be able to ask some
questions in ways that we haven't been able to do.
I'm excited about reuse of existing data, and so
hopefully some of that enthusiasm will wear off on
you today because it's really a great opportunity.
Now, in order to use large data sources, what that
means is that we need a common data model.
We need standardized coding of data and we need
standardized queries.
What I mean by that is that if we don't ask about the
same variables and we don't collect the data or code the
data in the same ways, it makes it hard for us to be
able to do comparisons then across software vendors or
health systems or academic institutions.
And with the PCORI grant for instance, we're actually
looking at how do we do common queries so that if
we've got the common models, we can write a query and
share the queries with others to be able to pull
data out from multiple health systems in a
similar way.
So I'm going to talk about what I mean by that a little
bit more and show you examples of how we have to
be thinking in nursing about this as well as thinking
So when you look at PCORnet, they started with a common
data model one, then they went to version two, and now
this is version three that's being worked on at this time.
So you can see in the top left hand corner we have
conditions which might be patient reported conditions
as well as healthcare provider conditions, but you
can also see that down in the left hand corner that
there are also diagnosis.
So diagnosis are ICD9 coding that goes with it.
ICD10 is that unfolding.
Notice when you think about your science, where is the
data that you want for your science, and is it
represented in this common data model?
I would suggest that there's many types of data in the
common data model that's important to all of us as we
think about where we're going whether it's
demographics of medications or, you know, what are the
kinds of diseases that people have?
And there's also something missing as we move forward.
So, before I get to what's missing one of the things
that I want to point out that's critical is that in
order for PCORI or NCATS or any of these other
organizations to be able to do queries across multiple
institutions they have to have data standards.
And so when we look at demographics for instance,
OMB is the standard that we use for demographics.
When we look at medications, it's RxNorm for medications.
Laboratory is coded with LOINC.
Procedures are coded with CPT HCPCS or ICD9/ICD10 codes.
We also have diagnosis that have ICD9/ICD10 but in
addition, SNOMED CT codes or another type of standard.
And when we look at vital status we're looking at the
CDC standard for vital status and with vital signs
they're using LOINC.
So LOINC started as a laboratory data.
It's expanded to include types of documents.
It also has expanded now to include a lot of
clinical assessments.
So you're going to find the MDS used in nursing homes,
Oasis that's used in homecare, you'll see things
like the Braden or the Morse Fall Scales, and we're
expanding more types of assessments that are
important to nurses in the LOINC coding.
It also, by the way, includes the nursing
management minimum dataset, which the announcement just
came out this week that we've just finished updating
variables and they've been coded in LOINC, so if you
wanted to look at the work of Linda Aiken, for
instance, you'd find standard codes that can be
used across multiple settings.
So, our vision of what we want to see in terms of
clinical data repositories that are critical for nurses
is when we look at clinical data, we need to expand that
to include the nursing management minimum data --
the nursing management dataset.
What that means is we need to look at nursing
diagnosis, nursing interventions, nursing
outcomes, acuity, and we also have to take a look at
a national identifier for nurses.
Which, by the way, every registered nurse can apply
for an NPI which is the National Provider Identifier
so that we could track nurses across settings, just
like we do any other -- you know, the physicians or the
advanced nurse practitioners, but it's
available for any RN to be able to apply.
So, when we extend what data's available, if we
added in what are the interventions that nurses do?
What are the additional kinds of assessments that
nurses do?
That data is really critical for us to be able to do big
data science.
What you can also see is that there's management data
-- often times we think of that as claims data -- but
when you think about management data it needs to
go beyond that when we start talking about
standardized units.
Like if I see a patient in an ICU does it matter and
how do we even name ICUs?
Or psychiatric units?
At Mayo we used to call it three mary bry.
Well, how generalizable is that?
So there are ways to be able to generalize the naming of
units and that actually builds off of the
NDNQI database.
And then when we look at the workforce in nursing, Linda
Aiken's work I think is just stellar in terms of really
trying to understand, what are the things that we
understand about nurses because they effect
patients' outcomes, and they also affect our nursing
workforce outcomes as well.
So our clinical data repositories need to expand
to include additional data that's sensitive to nurses
and nursing practice, and it also needs to go across the
continuum of care.
Now, at the University of Minnesota, we have a CTSA
award, and our partner is Fairview Health Systems.
And so you can see here that as we built our clinical
data repository we have a variety of different kinds
of data about patients and about encounters that we
have available to reuse for purposes of research.
You can bet that the students that I have in the
doctoral program are all being trained to be big
data researchers.
It's like, "Stick with me kid, because this is the way
we're going."
So they use this but they also use, like, some of the
tumor registries or transplant registries as
another data sources as well.
And this data's available then for looking at cohort
discovery or recruitment observational studies, and
predictive analytics.
Now, when you look at what's actually in there and we
characterize that data, we basically have over 2
million patients just in this one data repository,
and we have about 4 billion rows of unique data, so we
don't lack data.
What's important to take a look at is, what is the
biggest piece of the pie here?
It's flow sheet data.
And what is flow sheet data?
>> Female Speaker: [inaudible]
Bonnie Westra: Yeah, it's primarily nursing data, but
it's also interprofessional, so PTOT speech and language,
dietician, social workers, there are specialized data
collection for, like, radiation oncology and that
kind of stuff.
But a lot of it is nurse sensitive data.
So one of the things that we've been doing as part of
our CTSI or CTSA award, is we're looking at this what
we call extended clinical data, and developing a
process to standardize how we move from the raw data
and mapping the flow sheet data to clinical data models.
And that these clinical data models then will become
generalizable across institutions the actual
mapping to the flow sheet I.D.s will be unique to
each institution.
One of the reasons this is important is I was just
working on our pain clinical data model this last weekend
trying to get ready to move it into a tool we call i2b2,
and we had something like 364 unique I.D.s for the way
we collect pain data, and that those 364 unique I.D.s
actually represented something like 54 concepts.
Or represented actually I think 36 concepts, and when
you do pain rating on a scale of 0 to 10, we had 54
different flow sheet I.D.s that are pain rating of 0 to 10.
Why don't we have one?
So, what that means is that we have a concept in our
clinical data model called pain rating, specifically
0 to 10.
We also have the flack and the long baker and you know,
every other pain rating scale possible in the system.
But it means that we have to identify a topic like pain.
We have to identify what are the concepts that are
associated with that.
Then we have to look at how we map our flow sheets to
those concepts.
We then present it to our group in an interactive
process for validation before we can actually move
that into making it useful for purposes of research
-- researchers.
So we now have a standardized process that
we've been able to develop, and now we're moving it into
trying to develop open source software, so that if
you wanted to come play with us and you wanted to say, "I
like the model you're using and I want to use it, and
let's see if we can do some comparative effectiveness
research," that it's something that can be shared
with others.
And that's part of the nature of the CTSA awards is
that we develop things that can be used across so
everybody doesn't have to do it independently.
So here's examples of some of the clinical data models
that we've been developing.
So behavioral health, we have somebody who's a
specialist in that area who's working on a couple
of models.
Most of them are physiological at this point,
and we started that way because of another project
we're working with.
But one of the things that we started with internal is
we said, "What are the quality metrics that we're
having to report out that are sensitive to nursing?"
So when you looking at prevention of falls,
prevention of pain, CAUTI, VTE, and one other I can't
think of right now, but we really tried to take a look
at what are those things that are really sensitive to
nursing practice and then how do we build our data
models that can be used for quality improvement, but
also can be used then for purposes of research?
If we do certain things at a certain point in time, does
it really matter?
And then we've extended it to some other areas that
are, you know, based on what are the most frequent kinds
of measures that might be important to nurse
researchers to be able to work with.
Now, one of he things that the CTSAs do is many of them
use a tool called i2b2, and i2b2 can do many things, but
one of the first things it does is it provides you with
de-identified counts, of how many patients do you have
that meet certain criteria; so if you're going to submit
a grant, that you would be able to know whether you had
enough patients to actually potentially recruit.
One of the things that is missing out of it is almost
everything that's in flow sheets.
So, Judy Warren and colleagues proposed an
example of what would it look like in i2b2 if we
added in some of he kinds of measures that we're looking
at that are like review of systems of some of the
clinical quality measures.
So we're in the process of really looking at a whole
methodology of how to move that flow sheet data from
the data models in to i2b2 so that anybody could say,
"Oh, I'd like to study, you know, prevention of
pressure ulcers.
How many stage four pressure ulcers do we actually have
and, you know, what kind of treatments are they getting
and does it matter?"
And so that's an example of how this tool will be used.
Now, in order to make data useful it also has to be coded.
So remember the slide I showed you that showed we're
using RxNorm and we're using LOINC and we're using OMB
and we're using CDC codes?
Well, when we look at what code set should be used for
standardizing the data that we use that's not part of
those kinds of data, you'll see that the American Nurses
Association actually has recognized 12 terminologies
or datasets and they're done recognizing new ones.
Now it's just continuing to keep them up to date.
And so, the ANA just came out with a new position
statement, "Inclusion of recognized terminology
supporting the nursing practice within electronic
health records and other information solutions."
What that means is they say in that new paper that just
came out is that all healthcare setting should
use some type of a standardized terminology
within their electronic health records to represent
nursing data.
It makes it reusable then for purposes of quality
improvement and comparative effectiveness research.
However, when it is stored within clinical data
repositories or when we're looking at interoperability
across systems, then SNOMED CT is the standard that
would be used for nursing diagnosis.
So you might use the Omaha system of NANDA or CCC or
any of these, but it has to be mapped then to SNOMED CT
so that if I'm using the Omaha system and you're
using ICNP, that they actually can talk to each
other where they have comparable terms.
What the ANA has also recommended is that nursing
interventions, while there's many standardized
terminologies, actually use SNOMED CT for being able to
do information exchange and for building your data
warehouses if you're using different systems that you
want to do research with.
And that nursing outcomes would be used with SNOMED
CT, sometimes maybe LOINC, and that assessments be used
with LOINC, and I won't go into all the details
underneath that because it's more complicated than that.
Because sometimes the answers are LOINC and
sometimes they're SNOMED CT, depending.
So there's a lot that goes on behind the scenes, but
this is rally important because if -- and this
actually comes off of the ONC recommendations for
interoperability for clinical quality measures --
that's how these standards actually came about so that
it's consistent with the federal policy when we're
doing this.
So, ANA, it's on their website.
The URL was so long that we had permission just to put
it on our website and give you a short URL.
So if you want to learn more about it the URL is listed
down here.
So, another effort that is going on is that in addition
to some of the foundational work that we're doing
through the CTSA, is that there is a whole group
that's headed by Susan Matheny that is about how do
we build out an assessment framework in very specific
coding for the kinds of questions that we asked for
physiological measures?
So when we look at the LOINC assessment framework we
start with first physiological measures, and
then there's other things shown in orange called the
future domains that also have to look at what are the
assessment criteria that are documented in electronic
health records that need standardized code sets?
So there's a group that Susan Matheny is heading up
that includes software vendors, different
healthcare systems, people with EHRs that aren't the
same EHRs, and they're pulling together a minimum
set of assessment questions and getting standardized
codes for those minimum set of assessment questions and
they were just submitted to LOINC I think the end of
June for final coding and distribution in the next
release of LOINC.
And this group is continuing on to build out additional
criteria for assessment, so that we have comparable
standards across different systems.
Now, I mentioned that the nursing management minimum
dataset -- this was actually developed back in about 1997
recognized by the American Nurses Association and has
been just updated for two out of the three areas.
So in the environment you can see the types of data
elements that are included -- and this is very
high-level data elements -- there's a lot of detail
underneath these.
And you can see nursing resources.
Now, when this was updated we harmonized it with every
standard we could possibly find.
A lot of it has been NDNQI, so the Nursing Database for
Quality Indicators, but it's also been harmonized with
every other standard we could find so that there
weren't different standards consistently for these types
of variables.
It also -- if you've followed the Future of
Nursing -- Future of Nursing work from the IOM report and
the Robert Wood Johnson Foundation, it matches the
workforce data that they're trying to collect through
the national board -- state boards of nursing.
So again, if you're collecting data for one
reason that in fact you can actually use it for multiple
reasons when you're using a standard across the country.
So, there is a reference here.
You can go to LOINC.org, and if you look under news
you'll see the release that came out this last week
about this, and then you'll also see that if you go to
the University of Minnesota website that the
implementation guide is available that gives you all
of the details that you never wanted to know but
need if you're actually going to standardize your data.
So, the point of all this is that when you think about
using big data and you want to do nursing research, it's
really critical that we think about all of our
multiple data sources whether it's electronic
health record or if you're thinking about with
management minimum dataset for instance.
You're thinking about scheduling, you're thinking
about HR data, and that doesn't even begin to get
into all the device data and the personal data
contributed by patients.
So that's additional data, and think about what it's
going to take to standardize that in addition.
It won't be on my plate, but many of you might want to
actually do that because it's a really good way to
begin to move forward.
So the message that I wanted to leave you with on that is
there's lots of data.
When we think about nursing research that we are at the
very beginning of starting to say, "What data?"
And how do we standardize that data?
And how do we store and retrieve that data in ways
that we can do comparative effectiveness research with
that data or some of the big data science.
Just one example, I'm not going to cover today but
I'll talk a little bit tomorrow, is we're pulling
data our to electronic heath records to try to say, how
do we really understand patients that are likely to
have sepsis, and then there's the sepsis bundle,
that if you do -- you know, if you do certain types of
evidence-based practice quickly and on time, you can
actually prevent complications.
Well, we're pulling out electronic health record
data, and guess what?
This is really interesting.
We got an NSF grant to do this and so we said, "Well,
we're going to look at evidence-based practice
guidelines, nurses and physicians, well guess what?
The evidence-based practice guidelines for nurses aren't
really being used.
And so we're having to figure out how would you
find the data.
Not because nurses aren't doing a good job just the
guideline types of software wasn't used in the way
we thought.
So then we said, "Well, we'll look at, you know,
we'll look at certain data elements, and then we're
also going to look at physician guidelines and are
they being used?"
So, in order to know if you did something in a timely
manner, you have to know, when did somebody suspect
that sepsis began?
Do you know where that's located?
Maybe in a physician's note.
And so the best way to find out if patients are likely
to develop sepsis is nurse's vital signs and the flow
sheet data.
And so consistent documentation in those flow
sheet data becomes really critical.
And then if they're being followed and adjusted, you
have to understand things like fluid balance,
cognitive status, your laboratory data as well as
the vital sign data that's going on with that, and lots
of other stuff.
So this EHR data is critical in terms of being able to
really look at how do we prevent complications.
So I'm going to talk a little bit now moving into
more of the analytics.
So when we think about analytics there is a book,
it's free online.
This is not an advertisement for them, but it was one
that changed my life.
And so it's called, "The Fourth Paradigm of Science."
And it really talks about, how do we move into data
intensive scientific discovery?
And one of the things that I think is really interesting
is, how many of you have every read a book -- a
fiction book -- it's called "The Timekeeper?"
It is really a fun book.
The thing that's fun about it is it talks about before
people knew time existed, they hadn't picked up the
observational pattern thousands of years ago that
basically said that, "Oh, there is this repetitious
thing called time."
It then goes on to talk about the consequences for
us of how we want more of it, you know?
And so it's not always a good thing to discover
things, but, you know, our first science was really
about observations and really trying to understand
what do we notice?
You know, what's the empirical data?
We then moved into thinking about a theoretical branch.
So what are our models?
How do we increase the generalizability of our science?
From there we've moved into in the last few decades
computational branch which is really how do we simulate
complex phenomena?
And now, we're moving into data exploration or
something that's called e-Science.
So we can hear the term big data, or big data science.
E-Science is another term that's used for that.
So when you look at that, what you can see is that we
have data that's being captured by all kinds
of instruments.
We have data that's processed by software and we
have information and knowledge that's stored
in computers.
And so, what we really have to do is how do we look at
analyzing data from these files and these databases in
coming up with new knowledge?
And it requires new ways of thinking, and it requires
new tools and new methods as we move forward.
So foundational to big data science is algorithms and
artificial intelligence.
So how do we take a look at if this then that, if this
then that?
So it requires structured data, you know, so that we
can develop these algorithms to be able to come
to conclusions.
Now machines are much faster at processing these
algorithms than the human mind is, and they can
process much more complex.
So our big data science is really about the use of
algorithms that are able to process data in really
rapid ways.
Semi -- what we call -- semi-artificial.
Not totally like you just throw it in there and it
does it and it gives you the answer.
There's a lot more to it than that.
So there's some principles about big data science that
are important, and one of those principles is let the
data speak.
So, what that means is we often times will say, as I
take a look at trying to understand CAUTI is one of
the subjects that one of my students is working on.
She's really trying to understand, we have these
guidelines for how do you have this catheter
associated urinary tract infection, how do we
prevent that?
So if we follow the guidelines, why aren't we
doing any better?
And what's missing is we probably don't have the
right data that we're looking at.
So she's actually combining some of the management data
along with the clinical data to try to say are there
certain units?
Are there certain types of staffing?
Is there -- you know, how do staff satisfaction?
You know, how does that all play into all of this?
What's the experience?
What's the education?
You know, what's the certification, the background?
And so, she is throwing in more types of data and then
trying to let the data speak in terms of, you know, does
this provide us any new insights that we can
think about?
Another thing is to repurpose existing data.
So once you have data, 80 percent of big data science
is the data preparation.
I think it's closer to 90, but it takes forever to kind
of get the data set up because it's not like you're
collecting new data with a standardized instrument
that, you know, has all these validity and
reliability, so there's a lot of data preparation and
transformation that needs to go on.
So once you've got that done and you understand the data
and the metadata, that is the context, the meaning,
the background of why do we collect this?
What does it actually mean?
You know, give me the context of this.
Then we can understand, how is it collected?
Why was it collected?
What are the strengths of it?
What are the limitations?
When I first started in this, I worked in
homecare software.
There wasn't anything I didn't know about Oasis.
Because I learned a ton by making every mistake,
working with everybody I could, and understand
it thoroughly.
When I went to working with big health system data, I'm
like a novice all over again.
So once I get a good dataset set up believe me, I'm going
to be working with that forever.
And so you'll see some examples of that tomorrow on
a different talk.
So in big data science another thing that we have
to think about is that N equals all versus sampling.
So it's not necessarily about random sampling, it's
really about once you've got all the data, you know, how
does that effect your assumptions about what
you're doing in science?
And there's another principle called
correlations versus causality.
So, you know, randomized clinical trials are trying
to understand the why.
Why did this happen?
And what we're trying to understand and when we've
got big data is, you know, what's the frequency with
which certain things occur?
What's the sensitivity?
What's the specificity?
How do we understand the probabilities that go with it?
And so we're often times looking at correlations
versus trying to look at causation.
Big data's messy.
I've had a chance to work with our CTSI database where
they've done a lot of cleanup and standardization
and then I've worked with the raw data, same
software vendor.
I've certainly learned that once you have the data and
you clean it up, it really makes a difference.
And will it ever be perfect?
Absolutely not.
But we think our instruments are perfect, you know?
And they're actually not either.
So there is a certain probability that things
occur and you get a large enough dataset.
You know, it really makes a difference in how you work
with the data.
And then there's also a concept called data
storage location.
So, there are some people that think you should put
all the world's data into a central database and work
with it, and then there are others that do something
called federated data queries.
So federated data queries is where, like with our PCORI
grant, everybody has their own data.
It's modeled in the same way and so we can send our
queries to be able to do big data research without having
all the data in the same pot at the same time.
Another thing that's really critical is big data is a
team sport.
I can't say that enough.
If you ask me all the mathematical foundation for
the kind of research we're doing, I'm not the one that
can tell you that.
I work with these computer science guys that have very
strong mathematical background, and I get
educated everyday I work with them.
And so we need to -- and I also know from example that
they really don't understand clinical.
And so, you know, when we had a variable gender they
were going to take male and do male/not male
female/not female.
And it's like, you only have two answers in the database,
so why do we need four answers [laughs], you know,
for this?
But that's just a simple thing but they don't
understand, like, you know, what's a CVP, for instance.
I have to actually look some of that up now too as I'm
getting further away from clinical but it's really
trying to understand you need a domain specialist.
You need a data scientist.
A data scientist is an expert in databases, machine
learning, statistics, and visualization.
And you need an informatician.
So how do you standardize and translate the data to
information and knowledge?
So, you know, understanding all that database stuff and
he terminology stuff is really important.
As I said, 80 percent is preprocessing of the data.
And then there's a whole thing called dimension
reduction and transform use of data.
So, one of my student said, "Well, I want to use ICD9
codes so I'll ask for those."
And I'm like, "What are you going to do with them?"
And so she finally got down to what I really need to
understand is there's certain diseases that
predispose people to having CAUTI.
And so, I only need to be able to aggregate them at a
very high level to see -- and so it means you have to
know all your ICD9 structure and be able to go up to
immunosuppressive drugs for instance or other diseases
that predispose you to getting infections or
previous history of infections.
So, you don't want 13,000 ICD9 codes.
You really want high-level categories.
So it's learning how to use the data, how to transform
the data.
A lot of times we have many questions that represent the
same thing, so do you create a scale?
If your assumption for your data model is that you need
binary data, how do you do your data cuts?
You know?
So with Oasis data we use no problem or little problem
and moderate to severe problem because we need a
binary variable.
And so it's that kind of stuff that you need to do.
And then there's all kinds of ways of saying, how do
you understand the strength of your answers?
You can quantify uncertainties so you're
looking at things like accuracy, precision, recall, trying to
understand sensitivity, specificity, using AUCs to
try and understand the strength of your models.
So I'm going to quickly go through just a few examples
of how we're now moving into using some of these types of
analysis and some of the newer methods of being able
to analyze data.
So, one is natural language processing.
Another is visualization and a third is data mining.
What I'm not going to do is address genomics.
I wouldn't touch that one, it's not my forte.
So, natural language processing is really another
name for it is called text mining.
And that is, as we take a look at this, five percent
of our data is really structured data and the most
is not structured data.
So we really need to -- we really need to think about
how do we deal with that unstructured data because it
has a lot of value within it.
But, so an NLP can actually help us be able to create
structured data from unstructured data so that we
then can be able to use that data more effectively.
So, it really uses computer based linguistics and
artificial intelligence to be able to identify and
extract information and so free text data's really
the sources.
So when you think of nurse's notes for instance.
The goal is to create useful data across the various
sites and to be able to get structured data for
knowledge discovery.
And there are very specific criteria
for trustworthiness.
When I did my doctoral program and we wanted to do
qualitative research -- that was many years ago people
were a lot like, well that sounds like foo foo.
Well, now there is like, you know, really trustworthy
criteria and there's trustworthy criteria for
data mining as well.
So when you look at, how many of you have heard
of Watson?
Yeah, so when you think about Watson, Watson was
used initially tested with Jeopardy, you know?
And finally it beat human beings.
So now IBM is actually moving into how can we use
that for purposes of healthcare?
And how do we begin to harness the algorithmic
potential of Watson?
So, Watson is really an opportunity to begin to
think about big data science and do you know how they're
training it?
They're asking -- they're doing almost kind of like a
think out loud with physicians.
Like how do you make decisions?
You know, they're reviewing the literature to see what's
in the literature.
We need some nurses feeding data into Watson so that we
can get other kinds of data in addition.
But Watson uses natural language processing to then
create structured data to do the algorithms.
So when you think about another example, how many
heard of Google Flu Trends?
Yeah, so with Google Flu Trends, one of the things is
how do you mind data on the Internet?
What kinds of things are people actually searching
for that are things that are about flu?
What are the symptoms of flu?
What are the medications you take for managing the
symptoms of flu?
And what they found is that actually Google flu trends
could predict a flue epidemic before the CDC could.
Because it was based on patients trying to do their
symptoms, and then based on that, they could see that
there was this trend emerging.
Now when they actually looked at who had flu, the
reported flu and the Google trends, CDC outdid Google,
but it pointed to an emerging trend that
was occuring.
And actually what we're seeing now is we're doing
some of that kind of mining of data with pharmaceutical
reports looking for adverse events.
And so we're using the FDA has an adverse event
reporting system, and what they're finding is that as
they're looking at the combination of different
drugs that people are taking they're beginning to see
where adverse events are occurring through
combinations of different drugs that previously
weren't known.
So when you think about we do these clinical trials, we
get our drugs out on he market.
After the drug's out on the market it's like, how do
they actually work in the real population?
And I think Eric's presentation earlier with
that new graphic that just came out of Nature, that one
out of 10 or one out of 15 people actually benefits,
the question is how many people get harm?
And how do we know what the combination of drugs is that
could actually cause harm?
So there's some really interesting stuff that's
going on with mining data and looking at combinations
to try to understand, are there things we just
don't know?
So another area's looking at novel associative diagnosis.
When I first read this I'm like, "I don't get it."
And what it is, is that we're really trying to
understand what kinds of meaningful diseases co-occur
together that we previously didn't know?
So an example is obesity and hypertension.
That's a real common one.
We know that those two go together frequently.
But how many combinations of diseases that we just don't
understand go together?
So there's a team of researchers that compared
literature mining with clinical data mining and
what they did is with this massive dataset they looked
at all the ICD9 codes in a massive dataset.
So this person has these three or five or 14
diagnosis that all co-occur together and they said,
"What do we see in the literature of what diagnosis
co-occur together?"
Because they thought that they could validate commonly
known ones which they could and they could discover new
ones that needed further investigation.
Well, what they did is they looked at that, is that they
found there's very little overlap between diagnosis in
the clinical dataset and in the literature.
So the question is, is it that the methodology needs
to be improved?
Is it that we only know the tip of the iceberg of what
kind of things co-occur together?
Can we gain new insights about new combinations that
frequently co-occur together that can help us predict
problems that people have and try to get ahead of it?
Another example is early detection of heart failure.
So there was a study that was done and I won't
pronounce the name on this by this person and the team
and what they were really trying to do is can they
determine whether automated analytics having counter
notes in the electronic health record might enable
the differentiation of subjects who would
ultimately be diagnosed with heart failure.
So if you look at signs and symptoms that people are
getting, can you begin to start seeing early on that
this person's going to be moving into heart failure or
that their heart failure might actually be worsening?
So that you can anticipate and try to prevent problems
so that you can anticipate and try to make sure that
the right treatment is being done?
So they wanted to use -- they used novel tools for
text mining notes for early symptoms and then they
compared it with patients who did and did not get
heart failure.
The good news is, is they found that they could detect
heart failure early.
The bad news is people who didn't get heart failure
also had some of those symptoms.
So again, we're at the beginning of this kind of
science and it really needs to be refined so that we can
begin to get better specificity and sensitivity
as we do these algorithms that we're developing
for predicting.
Now visualization is another type of tool and, so as you
think about how do we understand massive amounts
of information?
So there's a lot of different tools for helping
us to be able to quickly be able to see what is going
on, and so these are just examples of visualization
not to read what the details are about this.
But what you can see is there was a study done by
Lee [phonetic sp] and colleagues where they were
trying to understand older adults and their patterns of
wellness from point A to eight weeks later in terms
of their wellness patterns.
But what they were really trying to do in this study
is to say, what kind of way can you visualize
holistic health?
And do you visualize holistic health and the
change in holistic health over these eight weeks by
using a stacked bar graph, you know, or one of the
other types of devices?
And then they had focus groups and they tried so
say, "What do you think about this?"
You know, "How well does that help you to process
the information?"
And so it helped them to be able to think about it --
it's really a cognitive science kind of background
of how people process information, what kind of
colors, how much contrast, what shapes and design help
people be able to process information?
So this is kind of an emerging area where we're
really trying to understand patterns related to
different phenomena.
Karen Munson for instance, one of my colleagues, has
been looking at this with public health data, and
she's looking at what are the patterns of care for
maternal child health patients?
Moms who have a lot of support needs from public
health nurses, and are there individual signatures of
nurses and how they provide care and are certain
patterns more effective, and with what subgroup of
patients are those patterns more effective?
So she's using visualization more like this stream
graphic over on the top left side here to look at
signatures of nursing practice over time.
So one of the things I find is that as we're doing data
mining, the genetic algorithms are increasing in
their accuracy and their abilities.
So if you think about the financial market, I don't
know about you, but I came back from a trip to Taiwan
one time, went to purchase something at RadioShack and
my credit card was declined.
And I'm like, "What do you mean my credit
card's declined?"
And they said, "It's declined."
And so I'd used it in Taiwan.
What I didn't know is that was an unusual pattern for
me and they happened to pick it up and they said, "Were
you in Taiwan?"
And I'm like, "Yeah, I was in Taiwan."
They said, "Okay, fine.
We'll enable your card again."
Well, it used to be that they would do a 25 percent
sample of all the transactions and be able to
pick up these abnormal patterns to try to look
for fraud.
Now they actually can process 100 percent of
transactions with fairly good accuracy.
So if they can do that with bank transactions, why can't
we do that with EHR data?
And part of it is they have nice, structured data
[laughs], you know?
In compared to what we're using.
So data mining is really about, how do you look at a
data repository, select out the type of data you want,
look at preprocessing that data, which is 80 percent of
the work, do transformation -- so creating scales or
looking at levels of granularity.
But then it uses some different kind of algorithms
and different analytic methods.
So up until I got to data mining on this graphic we're
really talking about traditional research in
many ways.
But when we get to data mining we're then looking at
all kinds of different algorithms that get run that
are semi-automated that can do a lot of process that we
have to do manually in traditional
statistical analysis.
And, in order to come up with results, the next step
is critical.
We can come up with lots of really weird results.
I can't remember the one that Eric showed earlier, or
maybe Patricia Grady did when she said, you know,
"Diapers and candy bars."
Or something like that.
But whatever it was, it doesn't make sense, and so
we really have to make sure that we're using our domain
knowledge in order to see, is this actually clinically
interpretable as we move forward?
So, data mining is also known as knowledge discovery
in databases.
It's automated or semi-automated processing of
data using very strong mathematical formulas to do
this and that there are absolutely ways of being
able to look at the trustworthiness of the data.
So we use -- a lot of it is sensitivities, specificity,
recall accuracy, precision.
There's also something called false discovery rates
is another way of checking out the validity of what
you're finding.
And there are lots of different methods, so some
of those methods are association rule learning,
there's clustering analysis, there's classification like
decision trees, and many new methods that are
emerging constantly.
So it's not like you can say data mining is just data mining.
It's like saying quantitative analysis, you know?
So it's lots of different methods of being able
to do this.
I think an example of data mining is the fusion of big
data and little babies.
So there was actually a study that was done looking
at all the sensory of data in a NICU and trying to
understand who's likely to develop infections and that
they were able to find that 24 hours earlier than the
usual methods of capturing continuous data from
multiple machines they were able to pick up who was
going to run into trouble and to head it off with the
NICU babies.
So, it has very practical applications.
Another example is looking at type 2 diabetes risk
assessment and really trying to understand with not just
association rules, but now we're moving into newer
methods of trying to look at time series along with
association rules and trying to see patterns over time
and how those patterns over time and the rules you can
create from the data will predict who's likely to run
into problems.
And so, some of the work that George Simon [phonetic sp]
has done with his group is really looked at survival
association rules and they substantially outperform the
Framingham score in terms of being able to look at the
development of complications.
So, in conclusion, big data are readily available.
We don't lack data.
The information infrastructure is critical
for big data analytics.
One of my colleagues I've done research with, she
said, "I just keep hoping one of the days you can just
throw it all in the pot and something will happen."
And it's like, that is not what big data analysis is about.
There are rules just like there are for qualitative
research or quantitative research.
And that the analytic methods are now
becoming mainstream.
So 10 years ago it would be really hard to get data
mining studies funded unless you went to the NSF.
Now that's getting to be more and more mainstream.
As a matter of fact, if you look in nursing journals and
you look for nurses who are doing data mining, you won't
find a lot out there yet.
So it's still just really at the beginning, but at least
we're starting to get some funding available now for
doing it.
So, one of the implications though out of this that we
really need to be thinking about is how are we training
our students, the emerging scientists.
How are we training our self here today?
But how are we training the emerging scientist to really
be prepared to do this kind of science of big data
analysis, and the newer methods that need to be done?
How do we think about integrating nurses into
existing interprofessional research teams?
So, I don't know about you, but how many nurses do you
know that are on CTSAs that are doing the data mining
with nursing data as part of the data warehouse?
Or on PCORI grants where they're building out, you
know, some of the signs and symptoms that nurses are
interested in are the interventions in addition to
the interprofessional data.
And so, it's really important that we take a
look at making sure that we're including nurse
sensitive data as part of interprofesional data and
that means that we really need to be paying attention
to the data standards, you know?
So that we are collecting consistent data in
consistent ways with consistent coding so we can
do the consistent queries to be able to really play in
the big data science arena.
So with that, I'll stop and see if you have
any questions.
I think we have one minute [laughs].
We have a question over here.
Okay, so the question is how do you find the colleagues
like in computer science who can really help you?
Well, I tell you, I was really ignorant when I started.
I actually worked with somebody from the University
of Pennsylvania the first time I did it because I
didn't know any data miners at the University of Minnesota.
And, then I got talking with colleagues who said, "Oh, do
you know so and so who knows so and so?"
And then I started actually paying attention to what's
being published at the University of Minnesota.
It turns out that Vipin Kumar, who's head of the
computer science department, is actually one of the best
internationally known computer scientists
actually, he and Michael Steinbach, one of my
research partners, have their own book published on
data mining for the class that my students take with
-- along with the computer science students.
So, one, start with looking at -- if you look at some of
the publications coming out of your university, it's the
first place to start to figure out if you have
anybody around who can do data mining.
And I just didn't even know to think about that when I
first started.
So, it's a good way to start.
Part of it is playing attention to there's a
number of -- if you go to Aimia for instance there's a
whole strong track of data miners that have their own
working group at Aimia.
Also, there's a lot of data mining conferences going on
and so if you just start searching for -- I mean,
personally I do, I would do data mining and University
of Minnesota in Google, and that's a really fast way of
finding out who's doing that as another strategy to try
to find partners.
And they were thrilled to death, believe me, to get
hooked up with people in healthcare because they knew
that was an emerging area, big data.
They just knew that they didn't know it, and I didn't
know what they knew so together it made a
good partnership.
Okay, thank you.
