Table of Contents
Open Table of Contents
Meta Data
Meta Data is a series of interviews with the fascinating people behind open data. As long-form transcripts, they are best enjoyed in a quiet hour with your favorite hot beverage. If you consider this project worthwhile, why not share it with a friend ;).
You can find an overview of all interviews at #meta-data.
About David Gasquez
David is a data engineer from Spain. He maintains two data portals in the Web3 ecosystem: The Filecoin Data Portal and the Gitcoin Grants Data Portal. These data portals are Open Data Infomediaries that extract data from various sources, improve it and make it available for the community in easy to consume formats.
In addition to working with open data, David is interested in funding models for public goods like data infrastructure, such as retroactive public goods funding.
Interview
In the following text, my (Philips) questions are highlighted in italics.
David’s Background
Philip: At the start, can you give me a rough overview, how you would self describe what you’re doing in the open data ecosystem?
David: Yeah, right now, what I’m doing in the open data ecosystem is focusing on the Web3 community and trying to make open pipelines with open data inside that ecosystem.
Okay. And you work right now self-employed, right? You work on your own projects and not at a company.
I am a contractor under Protocol Labs and I work on multiple other smaller projects around the ecosystem.
How did you learn the skills you need to do that work, or how was your rough path to get where you are right now?
I think it all came up pretty organically. I studied computer science, then I started working as a back end engineer doing Python tests. At the same time, I started playing with Kaggle competitions and I got very into competing for data, like literally building machine learning models for a competition. And over time, I applied, got a job as a data scientist.
And once I got there, I realized, I have no data, I can’t do data science without data. I started doing more and more data engineering. And once I had the data engineering skills, I realized, oh, these Web3 ecosystem folks are not doing data the same way the rest of the world is doing. And it can have a large impact, that’s how it all came up very organically. And the data skills, mostly by failing and trying a lot of things.
Basically learning by doing?
Yes, mostly learning by doing.
Then you have like a computer science / software engineering background and you kind of moved into the data side of things from programming?
Yes. Initially data science and then more and more data engineering and then more and more data engineering related to Web3.
Open Data in the Web3 Ecosystem
You already said kind of like the Web3 folks do data different from the rest of the world. Could you give a very rough idea how the open data works in the Web3 ecosystem? What kind of data is there?
Sure. So the data layer on the Web3 ecosystem… Web3 is very weird word, because I think a lot of people will be afraid of that, or not like that word. But basically, blockchains, by design are open. And that means, anyone can see virtually all the things that happen inside a chain, and at the same time, they are permissionless and immutable.
These are three very sweet properties for open data to be there, because you can get the data you want, you don’t have to ask for permissions, and you know that data won’t change because it’s literally a blockchain. So when I was exploring the ecosystem, the data layer seems very odd, and most of it was centralized.
So what I did was starting to work on trying to use modern tools and modern approaches to data with this kind of data, which is basically chain data coming from people running the nodes. That you can ask: Give me the transaction on this block, or the following block, or traces or things like that.
I think there are multiple things to unpack there a tiny bit. First, I have to admit I have probably the same preconceptions you encounter that makes you say, Web3 is not the best word… Because it is linked with a lot of hype and scams, I would say the whole crypto ecosystem. But you approach it more from the data direction, like you work with data, and there’s also funding models building on it, right?
Yes. I would say the finance part is the part that I’m not interested in. I’m more interested in the protocol levels, the things that it enables more experimentation, open by design, things like that than on the financial side of things.
And then on the second thing - which I’ve never really thought about, but it’s obviously true - that basically everything that is on a blockchain is, by definition, open data, in the sense that you can use it. You are free to share it, I guess? Can you link it to how people might understand an open data license? Would that apply to all the data that is on a chain and is that explicit? Maybe even, do these coins say: our chain data is under this and this license?
To be honest, I’m not sure, but I think they are all open. And I don’t think, like Dune, that’s basically an indexer, ask for licenses or anything.
Basically, because you can run the software and generate that data set on your laptop or on your computer without having to ask for any kind of permission. So the data you’re generating is in everyone’s computer.
Okay, true. But it wouldn’t necessarily mean you could use it, for example, commercially, right? But…
But I think you can. I’m not sure.
Then there is an understanding I would say, in the community, that it’s okay?
Yes, I will have to search for that, but I’m pretty sure you can.
I think your data you’re working with, if I looked this up correctly, it’s under the CC0 license, right? Is there a reason why you chose to publish with that license?
It was the most straightforward one. I looked at a couple of licenses, and that was the easiest one and the most permissive one.
Then let’s talk a bit about the data you are actually working with. Can you give an overview what kind of data structures you’re working with? I think the blockchain, for example, as a data source might not be that familiar. Like how can I access it? What kind of file types come out of it? What kind of programming languages or something you use to get the data, and then what is in the data?
Okay, so that’s a great question. Both in Filecoin data portal and the Gitcoin data portal, that are the open data portals I maintain that are close to Web3. Both rely on lots of different sources. They are basically like a classic data pipeline that you will find, or data platform, that you can find inside a company.
But instead of doing the business logic for a company, you’re doing it for a project or community. In this case, for Filecoin data portal, I’m getting data from the chain itself. And you can find remote people that are exposing their nodes online or services that you can rent to get access to a node and then ask the node for all the data you need.
How would you do this? Like a REST API, or how do you communicate with a node?
Yes, it can be rest API. It can be a library that you use in Python or R or whatever. It could be folks exposing an SQL endpoint for you to run queries in there.
It depends on the level of trust you want. If you don’t trust anyone, you will run the software locally and use their JSON RPC endpoint to get the data. If you trust the remote node, you will do an API call to the remote node. If you trust the people that have built the pipeline, you will do the query on the SQL level. So it depends on how far you want to go.
And then what kind of data do you get out of such a call?
Mostly JSON.
So you get basically JSON about the transactions that are on the chain?
Yes, transactions, blocks, any of the primitives that are common to blockchains.
I think that is kind of where it goes into domain specific data types. Because transactions and blocks and so on… I think are very familiar to people in the crypto space, but maybe not outside it. Can you give an overview of what that is?
Okay, I can try. Basically, a blockchain is a series of blocks. So every x amount of time a new block will be produced. And in that block you have a bunch of transactions. And those transactions change the state of the entire blockchain.
So you can get the information about the block, things like the hash or the number of transactions it had, or also get the transaction information. What did they do? Someone sent money to someone else. Someone else invoked a contract, things like that.
And in this sense it’s not only something like money? I think that is the most obvious thing people might be thinking of for crypto. But you can have transactions that change other states, like the contracts, for example, being executed.
Yes, contracts or in the Filecoin use case, it could be people making deals for storage. You are sending a deal to a storage provider saying, I want you to store this, and I will pay this amount.
And then you rent out storage using contracts?
Yes. Or in Gitcoin it will be, I donated x amount to this project and that’s published via transaction that anyone can create.
This is for funding of open source projects?
Yes.
Okay, we probably should have covered like the use cases of the portals at the start (laughs), but…
Exactly, it depends. Even though the schemas might be similar between projects, you get so many applications and so many different ways of looking at the data. That’s the thing that I’m trying to do with data portals. It’s like serving a community, and it’s looking at some kind of data.
In the Filecoin data portal, for example, I’m getting chain data, but most important things I’m getting from other open APIs that I’ve been able to find around.
So one big data source is on chain blockchain data, but you also link other data from other sources with it, right? What kind of data do you use there?
Yes. I will link things like geolocation to know where are service providers. Reputation models, like there are some companies that have built reputation for service providers looking at other metrics.
So I will ingest that and try to create like a unified view for clients, providers and overall all users in the Filecoin ecosystem. I will also add things from other APIs that people are running tests on clients or storage providers. If I find something that I can link and connect to my entities, I will just ingest that.
I saw you also publish in multiple different formats, right? Can you give a rough overview what you publish into?
That depends on the community. For the Filecoin data portal, what I’m doing is publishing all the datasets every day as parquet files on R2. And that makes anyone that wants to check the data or relies on that data just have to download that parquet file and start using it. And for the Gitcoin one, I’m also publishing to IPFS.
And those are the main places where I’m publishing. R2 on Filecoin, IPFS on Gitcoin. But then Filecoin data portal is also being published to Google sheets, some of the datasets to Dune… basically it depends on the community. I will customize the delivery, so the community can access that data in a smooth way.
The crypto / Web3 space would strike me as a space where the people are probably more technical, right?
Yes, generally more technical. And the people that are looking at the data are probably the same developers or very close to the development. So they know how to ingest these sorts of things.
So you basically your customers, so to speak, are the people that then can actually programmatically work with the data, they can access it and so on?
Yeah, most of it, yes. And for those who can’t or don’t have the time, I’m also publishing some of the data sets to Google sheets, so they can “import range” from their Google sheet and start playing with that. At least the important thing is getting the data, whatever that means for them.
Software Stack of the Open Data Portals
Can you talk a bit about the technical setup of the portals? How does it technically work?
Initially, I started working on a classic data pipeline or data platform using data warehouse like BigQuery, things like that are paid and closed source. But then I started moving more and more things… and to try to continue the ethos of Web3, or some of the ethos, I think Web3 represents of keeping things open and modular, permissionless and all of that.
So right now the stack is simple in terms of you can run it on your laptop, and it runs on GitHub actions. And it consists of the data warehouse in your laptop, DuckDB, then pipelines with Python doing the transformations, and then it is being published to R2 or like an object store. That’s the main tools.
So the source is basically a DuckDB on your laptop or in GitHub actions or something. And then you work with the data in there using Python?
Yeah, the pipeline you start, there are a few Python scripts orchestrated by Dagster that read all the different API endpoints or data sources. Save that to DuckDB database as a bunch of tables. Then you have a bunch of SQL and Python scripts that transform that data. And then once you have the final tables, they get published to somewhere.
And right now the actual production portals, they run on GitHub using GitHub actions.
Yes. And that’s because it’s a way for me to show this is being run on GitHub, you can see the logs, you can see all the things. So you know the code is what it’s been running, and it is what is on R2.
Okay, for you, it’s a way to prove that the data wasn’t changed or how it was changed in between and stuff like this?
Yes.
Money for Public Goods, Retroactive Public Good Funding
Okay, then we got to a point where you have very interesting information about funding models, for example, for this kind of thing. Because we spoke before, and we both kind of agree that a large part of any long-running project has to be some form of incentive or some form of funding for the project, not only working for free and at some point maybe stopping. Can you talk more about how you fund your projects?
Yeah, definitely. The biggest thing with this sort of data pipelines is that they change continuously. And maintaining a data pipeline, it’s not something fun for most folks and also requires a lot of time. Even smaller changes might affect downstream like ten different assets or pipelines, and that you have to spend like one week updating.
So finding the resources and money to pay folks to maintain that is important. And in the classic academia or Web2 world of open data, I don’t feel that’s something very well rewarded.
Nonetheless, in Web3, what I found is, thanks to the novel ways of funding and the quick experimentation that the ecosystem has, there is something called retroactive public good funding which has been working well for me, and I can get into that if you want.
Yeah, I would love to get into that.
So basically, retroactive public good funding (retro PDF) is like a grant you get after you’ve done the thing. I will use Gitcoin as an example, every few months they run rounds of retro PDF, and they will gather some money themselves, get money from other companies, and then open up a project where you can submit your application, and they will list all the applications.
Then people, other users, will vote in terms of just sending some coins to those projects. And the common pool of money, it will then be split among all the applicants depending on the number of votes you get.
The interesting part there is: You’re paying someone for the work they already did. And it’s easy, it’s much easier to evaluate the impact of something that’s done and being used than something that will be done in the future. It’s probably much riskier because you have at the beginning… when I started the Gitcoin data portal, I didn’t have in mind to get in paid for that at all. But once people started to use that and more folks started to even collaborate, and I applied, and I got very great results, because it was having an impact and people were relying on it. So it’s a very nice way to getting rewarded.
That goes into the direction that I would have immediately thought of: There seems to be a bootstrapping problem with that, right? Because if you always reward work that was already done, you will, for the first iteration, need to work for free, basically without a promise of having something in the future?
Yes. This kind of things work to incentivize public good funding. Kickstarting them might be another problem that I don’t know how to solve. For me, in the Filecoin data portal use case, I got paid to start that. So it’s like: I start that, now it’s a public good. I know I can get also rewarded by the retro PDFs there.
So you got paid to start it from somewhere else, not from the retroactive funding?
Yes.
Okay. So to explore a bit more of this: You said, companies give money and the community… into the shared pool. Are these companies also in the ecosystem of this coin or outside companies?
That depends on the retro PDF style that it can be run in many, many ways.
You can run things in like very closed way, where there are no votes from the people and your company runs a round, you select the people that are going to evaluate the projects and then give money back.
Or you can run things, for example, you run a blockchain, you get some rewards from the fees on the messages and then use that shared pool to reward smart contracts by their usage.
You can like customize how you run things. The ones running Gitcoin are, like I mentioned before, there is a shared pool of money that comes from Gitcoin and other contributors and then that gets distributed among all the applicants depending on the number of votes and a specific formula they’re using.
And how are the votes weighted? Is it always one vote?
No, they do like quadratic funding. So it’s a different formula that incentivizes getting more votes. If you’re voting for my project with $1,000, I might not get as much money as if my project gets 500 votes for $1. Because 500 votes mean there’s more people that are using my project than one big person.
Okay. But there comes in a bit of the financial nature of the crypto ecosystem, right? That your votes are basically weighted by some dollar amount you’re willing to invest. But there’s an offset, if you have a lot of votes, it’s also considered very important.
Yes. And this is like this implementation of retro PDF. You could run things on voting in a Google spreadsheet and then distributing the money there.
So the larger idea in general is you build up this fund for existing projects, you have some form of community voting, and then you retroactively distribute the fund over the projects that applied, right?
Yes. I was going to mention there are new projects that are evolving, things like having a chain of funding. So for example, in Filecoin data portal, for any dollar that project gets, I’m giving, I think it’s 80% to downstream projects.
So if at any point projects like DuckDB or Dagster create an Ethereum account and link it to this new service or tool, they will get 80% of what I’m getting. So you can distribute upstream the money, you try to push upstream, all the things.
There is this famous image of the one maintained project in open source keeping up the whole world. (David: Yes, the xkcd (laughs) 1.) And then that project might get rewarded if it was in this ecosystem by all the projects that rely on it, which seems very fair, right?
That’s the idea, yes. And that also just having public goods inside an ecosystem makes that ecosystem so much better. So figuring out a way to reward those public goods is the most important thing.
And that’s why I’m super interested into these kinds of things, because there are blockchains experimenting with this right now. Probably all of them doing one way or another way of doing virtual PDFs.
And how does it practically work? I’m coming more from the academic side, so I am aware of grants, but I think one of the big differences is the amount of time you need to spend actively applying for grants versus the time you have to work. Can you guide me through how would you write a grant, how much of your time goes into that and so on?
Yes, for sure, it is probably faster. So the main thing, you go to Gitcoin or to whoever is running a grant, and it will have a form. And that form is probably done in five minutes.
What I did was to copy the README of my project, add some impact I already had, showcasing I’ve done this or this other thing that helped this group do this other thing. A few examples, and then you link your wallet and that’s it.
Since it’s for Gitcoin, for example, it’s permissionless. You can apply with whatever you want. You apply, you might need to get approved. You probably will if you’ve done something related with that, and you’re in. So in a matter of one day you can start.
And you already have stuff to show anyway, right? Because it’s retroactive.
Yes.
Is this also the idea that people should evaluate your project on what you did and not necessarily on what you plan to do? Like you could apply and say, I plan to stop, but for the last six months I did this, so I would like to have a last round of funding. Or do you need to tell people what you plan to do in the future?
At the end of the day, you’re trying to convince users to vote for you. If you’re saying I’m going to do this, perhaps you’re going to get more votes. But yeah, it depends.
From the community spirit, is this something that rewards people that work on good projects but might not be that good at advertising or people that are very good at advertising but not necessarily about doing work? Because, if I compare it with something like Kickstarter for example, a large part of this would be how good you are at marketing yourself. How would you say the community spirit is there?
Hard for me to say with only one or two projects done, but I will say I’m not good at marketing, and I’ve done very well on their rounds without any kind of advertising. I haven’t been pushing this far or making pretty charts or pretty graphics. So in this case, I think since the community is smaller and more technical, you are not as prone as fighting for marketing gimmicks or things like that?
I’m not entirely sure, but in my case, I didn’t spend a lot of time in marketing. Just put the projects, share what I’ve done and since people were already using it, it was natural for them to vote more than if I had spent more time in the marketing. I don’t know how could I do that even.
Okay. And this you probably don’t have a clear answer to, but just speculating: Do you think this is a funding model that would also work outside the Web3 ecosystem? Let’s say for an open data portal for a country, for example. Would you think you could for example fund that using something similar?
I hope so. I’m not sure if there is any physical, real world company or organization trying this out, but in theory, I don’t think there’s anything that might prevent that working.
I am not aware of any experiment… I’m not sure. I think the Linux foundation might be trying or will try something like this. I’m not sure if that’s true.
Okay. Sometimes you see around here in Paris a participatory budget, I think. And they say: This park is maintained by your participatory budget, it sounds like you can vote for it. Maybe it’s a way to have the public funding of public goods like parks and so on as well, using this.2.
Yeah, I wouldn’t be surprised if the first time this idea was tried or developed wasn’t in Web3. What’s interesting here is the rapid amount of experimentation, every time Gitcoin runs a new round, they apply the learnings from the previous one. And so if it’s round 20, you have learned from 20 past historical times, like what works, what doesn’t work, how to surface projects, how to measure impact, which is the tricky part here.
Do you have a rough idea because you have participated in a few rounds in what works? What are the top three tips for running a good round?
Again, from my experience, and my experience is biased because the data side of things, there are not a lot of folks doing open data in Web3. So providing a community with data both in Gitcoin and in the Filecoin ecosystem was very straightforward and something like… it felt like product market fit from day zero. So you start publishing some data sets and people start using them, asking for more data. It was very organic.
What are most projects doing if the data providing part is relatively small?
Probably it’s more things like wallets or smart contracts or applications, DEFI applications, things like that.
So like using the coin for something or providing tooling for the coin itself?
Or interfaces, block explorers, things like that.
Earning $10K / Year With Open Data Portals
And lastly, and I am totally happy to cut this out, so feel free to not say anything about it. Are you open for sharing a rough idea about what this means in the income sense? Like is it hobby money for ramen? Is it self-employment, engineering salary money? Are you getting rich of it? Like what is the dimension of…
Sure, yes, I can share because it’s not like I can hide that information anyway.
Ah yeah I was going to say… It’s actually probably open, right?
Yes. So I don’t think this is accurate. I might have received a bit more, but this is the checker 3.
So this is roughly $1,900 from 1090 people?
Yes. And then the pools, I got like around $3,500, although I think I got twice as that. Like this is not counting the first two rounds. So I would say it’s like twice that in one year or something like that.
Okay, so this is basically your funding page you linked. It’s around $1900 from people directly. And then it roughly gets doubled from matching funds, matching pools, it says here. So in general it’s like $5500 roughly for the year, right?
Yes, probably. It is probably twice that. This is not covering all the applications.
Okay. So we say very roughly $10,000 for the year.
Yeah, probably around $8,000. And this project gets less than 1 hour of maintenance per month from me.
Yeah, I wanted to say for open source work, that is quite a large amount of money. I would assume there’s a lot of open source projects that get much less funding because they rely largely on donations, which I think is much harder to incentivize.
I guess that’s why Gitcoin does the rounds in very discrete periods. It’s hard to donate organically, but if you’re doing a push, like there is this round, new round, and then you see everyone sharing their… I donated to this person. I’ve donated this. So you create that environment where folks are encouraged to donate, and I will go and just also spend some time. Okay, I like this project…
It’s also a nice way to highlight who has worked on things in the last time since the last round, right?
But I will say it will be like around $8,000 for the Gitcoin and I think $5,000 for the Filecoin one.
Where To Learn More and Recommendations
At the end, I would move on to just some final questions. One of them is where can people find out more about you or your projects?
Yes, I’m active on Twitter / X or whatever https://x.com/davidgasquez 4. And then I have a website that’s https://davidgasquez.com/ 5.
Do you think I have missed any questions about this I should have asked? Do you think there’s a topic that is important to know that I didn’t cover with this?
No, I think we covered most of the interesting things like why I went into or why I think it’s interesting to work on the data side on Web3, the ethos and values of the pipelines, the projects I’ve been running, and retro PDFs. So those are the three things where I can share and perhaps help people the most.
To finish, some short recommendations. Start with something like resources for learning what you’re doing. Do you have any, you know, books or podcasts or something you can share that you consider interesting in this space? Top three.
Yes. For data, there is a great book designing data intensive applications. That’s for the data, that gives you a very well overview of working with data and all the issues, things like that. That’s for data, mostly data engineering 6.
Then in data science, my recommendation and what worked for me that might not work for everyone is just do things on Kaggle. You can join a competition, start tinkering with machine learning models, and you might not get to first, because that’s super competitive. But if you enjoy that, you will learn lots of things and people will share their solutions. It’s a very organic way of learning.
And then around Web3, I don’t have any podcasts or recommendations there. It also depends on what are the interests, if you’re more interested in finance or in protocols at the protocol level, I maintain a handbook with some resources that might help 7.
Do you have a recommendation for a few people to maybe interview in this context as well? So people that are working in the open data ecosystem that are interesting to talk to.
So there is Carl Cervone 8. They are working on open source observer, which is similar to Filecoin data portal or Gitcoin data portal. But they’re trying to do work on a data portal, and it’s very awesome, both technically and how they are working.
A data portal on impact around… right now, probably around Web3, but they are very, very open to other things. So they will have Ethereum, optimism, a bunch of chains, a lot of meta information about those chains. A lot of models, like SQL models derive useful stuff on those things, and they’re working everything and doing everything in the open.
And then there is also folks from Our World In Data that are working on this pipeline 9.
You will have the Our World In Data folks that are grabbing data from lots of tricky places like PDFs and are creating very high quality country level data sets and then the open source observer folks that are working on a massive data platform to work on how to measure impact of open source projects.
Footnotes
-
I was wrong on this, the Paris Budget Participatif does not retroactively reward projects but distributes money for future projects. Paris Budget Participatif ↩
-
The compute graph for Our World in Data’s data processing. ↩