United Vote NYC Day 5
Goal?
Build united.vote/nyc.
What’s united.vote?
See united.vote/sf.
Oh united.vote/sf is broken? They’re rebuilding the new version of liquid.vote?
And it’s
Day 5
Liquid Vote is now United Vote.
Everything always takes a little longer than you’d expect. Starting is easy, making progress is fun, finishing is hard.
In the spirit of finishing, here’s a pearl of wisdom from @ergray’s cleanup of his PR:
The goal of cleaning up a pull request is for the pull request to clearly solve one issue, while causing as few new other issues as possible. In essence, change the existing codebase as little as possible.
Your PR should be experienced as something like “in as few lines, of clearly readable and understandable code, that doesn’t affect any other functionality of the app, I offer a solution to problem X. What do you think?”
And Eric’s PR is just about there.
Changes
A lot of good things
- it’s
https://united.vote
now - purescript API has been updated, and responds with more bills per request
- schema for the API backing
united.vote/nyc
has been updated withuid
SYNC_BILLS
reducer has been fixed correctly update the/sf
bills state- pathing between
/sf
and/nyc
has been narrowed down toNextAgendaScreen
Next Steps
Let’s make ourselves some more problems. How do we gracefully handle more local legislature APIs? What if we add /oak
or /santa-cruz
or /seattle
or /la
I suppose the first step towards that end is to find faster ways to build and maintain them. Then address them on the client side.
Which means refactoring my scraper. Which I’m really excited about.
What’s the current state, and what are the refactor goals? There’s a lot to abstract.
Current state:
- single variable hard coded urls for each agenda
- seperate runs per agenda additional
- scraper doesn’t check additional bills against past bills; it just adds them
- no easy way to clean wipe the
seed.json
file if needed - deployment takes 4-5 commands
- no automation of new agendas as they are released
- query selectors have to be changed manually
- target seed has to be changed manually
Refactor goals:
- allow scraper to take an object containing urls, query selectors, target seed.json
- check additional bills against past bills before adding them
- create a wipe script to reset the production
seed.json
- automate deploy process if possible
- test refactor on oakland city council agenda
Alright sounds good. Let’s start by refactoring our NYC to accomodate these changes. Here’s our first shot at an options hash for our current scraper.
const nycOptions = {
agendaUrls : [
"http://legistar.council.nyc.gov/MeetingDetail.aspx?ID=565756&GUID=2DE3E475-8E5B-4B27-9A8C-58C830156A1B",
"http://legistar.council.nyc.gov/MeetingDetail.aspx?ID=563540&GUID=26A3BD82-86F3-47FE-A13E-AAC05219B54E" ],
agendaDataSelectors : {
date : "#ctl00_ContentPlaceHolder1_lblDate"
}
billDataSelectors : {
bill_id: "#ctl00_ContentPlaceHolder1_lblFile2",
item_number:"",
title:"#ctl00_ContentPlaceHolder1_lblName2",
text: "#ctl00_ContentPlaceHolder1_lblTitle2",
sponsors:"#ctl00_ContentPlaceHolder1_lblSponsors2",
fiscal_impact: "",
status_log: "",
question: "",
date: "#ctl00_ContentPlaceHolder1_lblDate",
source_doc: "",
uid: "",
},
billDataPlaceholders : {
bill_id: "",
item_number:"id",
title:"",
text: "",
sponsors:"",
fiscal_impact: "None",
status_log: [{}],
question: "A motion was made that this Introduction be Approved by Council approved by Roll Call",
date: "",
source_doc: null,
uid: "billId",
} outputPaths : {
seed : "../nyc-api/config/seed.json",
local : "./agenda.json"
}
}
Now I want to test this, but to do that requires either the wipe script, or checking bills from the current scrape, against past bills. Definitely the better option is to check the current scrape against already existing seed bills. So let’s test that out a little bit. I want to take the current seed array, and compare every element in that array, to every element in the possible scraped data array.
The current seed is about 200 elements, and the new scrape might be 100 elements. So we’re talking 200x100, or 20,000 operations here. Each agenda will probably remain around 100 elements. The total body of agendas might grow to say 1,000 - 10,000 elements(10-100 agendas). So at the high end, every scrape might require a check of 10,000 x 100 elements, or 1,000,000 checks. What’s that? Quadratic time complexity. Works for now, but not forever. But with some help, dedupping isn’t actually that tricky; we can use _.uniqBy
which sorts the array of scraped bill objects by some specific property like uid
, and then deduplicates, presumably with linnear time complexity.
Turns out there are a bunch of duplicates, because there are multiple actions on a single bill, in a single agenda. And there are actions on a prior bill, on consecutive agendas. This is all fine for now, because what we want in our bill data is bills, not actions. The action history can go into the status_log
json field. We can do that later.
What else? Created a little wipescript. Real handy for when I had to update the production seed data. Done.
I took a first shot at essentially specifying the option hash for the scraper. It currently takes the following shape.
const nycOptions = {
agendaUrls : [
"http://legistar.council.nyc.gov/MeetingDetail.aspx?ID=565756&GUID=2DE3E475-8E5B-4B27-9A8C-58C830156A1B",
"http://legistar.council.nyc.gov/MeetingDetail.aspx?ID=563540&GUID=26A3BD82-86F3-47FE-A13E-AAC05219B54E" ],
agendaDataSelectors : {
date : "#ctl00_ContentPlaceHolder1_lblDate"
},
billDataSelectors : {
bill_id: "#ctl00_ContentPlaceHolder1_lblFile2",
item_number:"",
title:"#ctl00_ContentPlaceHolder1_lblName2",
text: "#ctl00_ContentPlaceHolder1_lblTitle2",
sponsors:"#ctl00_ContentPlaceHolder1_lblSponsors2",
fiscal_impact: "",
status_log: "",
question: "",
date: "#ctl00_ContentPlaceHolder1_lblDate",
source_doc: "",
uid: "",
},
billDataPlaceholders : {
bill_id: "",
item_number:"id",
title:"",
text: "",
sponsors:"",
fiscal_impact: "None",
status_log: [{}],
question: "A motion was made that this Introduction be Approved by Council approved by Roll Call",
date: "",
source_doc: null,
uid: "billId",
},
outputPaths : {
seed : "../nyc-api/config/seed.json",
local : "./agenda.json"
}
}
And deployment automation amounts to:
git add .
git commit
git push heroku master
heroku run nodal db:compose
Followed by visual inspection.
So that’s just:
"scripts": {
"test": "mocha ./test/runner.js",
"start": "node cluster.js",
"worker": "node worker.js",
"deploy": "git add .; git commit; git push heroku master; heroku run nodal db:compose"
},
Works like a charm.
And last but not least…
test refactor on oakland city council agenda
What stands out to me from looking at the oakland city council agendas
https://oakland.legistar.com/MeetingDetail.aspx?ID=562400&GUID=DF762A8E-4D55-4E79-814D-131985D28C12&Options=info&Search=
https://oakland.legistar.com/MeetingDetail.aspx?ID=568477&GUID=28B72F66-522E-4C39-B005-599E0D41DDB3&Options=info&Search=
https://oakland.legistar.com/MeetingDetail.aspx?ID=564804&GUID=5B718CAF-E7C8-4B65-AAEB-D11850B8EBB9&Options=info&Search=
Is that they take a slightly different form from NYC and SF. So while the options hash might be consistent, I imagine there’s going to be some fine tuning of the actual scraping. I’ll give that a shot tomorrow. I think this is what steady good work looks like.