However, nothing is forever, especially on the Internet. Over time, the board started losing components, which prompted more in-depth maintenance and code revision. So I am writing a series of update posts to list and describe that maintenance.
One of the components that bit the dust lately was the departure board for the buses at our nearby bus stop (see my earlier post on how I did it, I will be referencing it often.). Contrary to my fear that my brute-force string parsing of a JSON in PHP would give me debugging headaches, it actually proved remarkably robust. Something totally different happened: the query for the JSON was simply giving me Error 404. Apparently, the company just scrapped the service and seems to have migrated to a wholly different platform. Well, as I said, nothing on the Internet is forever, so I needed to re-implement the web scraping for the next bus departure times.
I will forgo the description of a 3-hour long troubleshooting spent in Chrome's web inspector feature (mostly because it happened too long ago for me to recover the details of it); suffice it to say that I ended up finding a new request in this format:
https://www.triplinx.ca/en/NextDeparture/NearByNextDeparture?stopId=(6-digit stop ID)
resulting in the following:
Sweet. A simple Inspect shows a clear document structure:
I wasn't inclined to modify my "bake the page directly in PHP" method, but I could now afford to parse the document somewhat more intelligently, like so:
function getElementsByClass(&$parentNode, $tagName, $className) {
    $nodes=array();
    $childNodeList = $parentNode->getElementsByTagName($tagName);
    for ($i = 0; $i < $childNodeList->length; $i++) {
        $temp = $childNodeList->item($i);
        if (stripos($temp->getAttribute('class'), $className) !== false) {
            $nodes[]=$temp;
        }
    }
    return $nodes;
}
$url = "https://www.triplinx.ca/en/NextDeparture/NearByNextDeparture?stopId=$stop";
$code = file_get_contents($url,false,$context); 
$routes=array(); $times=array();$rid=0;
$dom = new DOMDocument();
@$dom->loadHTML($code);
$list_node=$dom->getElementsByTagName("ul")->item(0);
$items=$list_node->getElementsByTagName("li");
for($i=0;$i<$items->length;$i++)
{
  $thisraw = $items->item($i)->nodeValue;
  // set route - we are in the outer LI
  $token1 = "Bus"; $pos1 = strpos($thisraw,$token1)+strlen($token1)+2;
  $token2 = "Show"; $pos2 = strpos($thisraw,$token2)-2;
  $rawroute = substr($thisraw,$pos1,$pos2-$pos1);
  if (strlen($rawroute) <= 5) 
  {$thisroute=$rawroute; }
  else // we are in the inner LI, so get time for the already set route
  {
 $thisattr=0; $thisattr += (strpos($thisraw,"real time")!==FALSE) ? 1 : 0; // real time attr
  $attr[] = $thisattr;    
  $diritem = getElementsByClass($items->item($i),"span","next-departure-label")[0];
  $direction = $diritem->nodeValue;
        $direction = str_replace("towards ","", $direction);
  $routes[] = $thisroute . substr($direction,1,1) . " ";
  $timitem = getElementsByClass($items->item($i),"span","next-departure-duration")[1];
        $thistime = $timitem->nodeValue;
        $thistime = str_replace("<", "",$thistime); // filter < 1 minute:: gone anyways 
        $thistime = str_replace("<", "",$thistime);
        $entity = htmlentities($thistime, null, 'utf-8');
        $thistime = str_replace(" ", " ", $entity); 
        $thistime = html_entity_decode($thistime);
 $times[] = $thistime;
  $rid++;
  }
}
where I borrowed the function getElementsByClass() from StackOverflow; the snippet aimed at getting rid of all  's was likewise borrowed from there.
As a result I was now getting an array of bus routes and bus times in string arrays $routes[] and $times[] respectively. However, I still needed to sort my buses by departure time rather than by route as in the screenshot above; the trick is that some of the times are given as "5 min" while some others are like "2:35 pm".
I got around the problem by determining the "number of minutes till all departures" and storing it in $nummin[], like so
date_default_timezone_set('US/Eastern');
$rawtime = new DateTime();
$rid = (($rid>6)?6:$rid);
for ($j=0; $j<$rid; $j++)
{
  if (strpos($times[$j],"min")!==false)
  {
   $nummin[$j] = intval(str_replace("min","",$times[$j]));
  }
  else
  {
   $temptime = new DateTime($times[$j]);
   $fulltime = getdate($temptime->getTimestamp());
   $temptime->setDate($fulltime['year'],$fulltime['mon'],$fulltime['mday']);
   $nummin[$j] = intval(($temptime->getTimestamp() - $rawtime->getTimestamp())/60);
   if($nummin[$j]<0)$nummin[$j]+=(24*60); // add a day if needed
  } 
}
and then using it to bubble sort all three arrays (6 elements aren't worth doing anything more sophisticated):
for ($i=0; $i<$rid; $i++)
 for ($j=0; $j<$i; $j++)
  if($nummin[$i]<$nummin[$j])
  {
   $temproute = $routes[$i]; $routes[$i] = $routes[$j]; $routes[$j] = $temproute;
   $temptime = $times[$i]; $times[$i] = $times[$j]; $times[$j] = $temptime;
   $tempmin = $nummin[$i]; $nummin[$i] = $nummin[$j]; $nummin[$j] = $tempmin;
  }
For output, we would need the reverse operation, i.e. converting "in 5 minutes" to a valid departure time. The reason is that we can't afford to pull the timetable every single minute, so labels like "in 5 min" would very soon mean "in 5 minutes, as of 3 minutes ago" which isn't very convenient to use. To get around this confusion, I have employed the following trick:
for ($j=0; $j<$rid; $j++)
 {
  if (strpos($times[$j],"min")!==false)
  {
   $live = $times[$j];
   $mins = intval(substr($live,0,strpos($live," min")));
   $newtime = (clone $rawtime);
   $newtime->modify("+{$mins} minutes");
   $bustime = date("H:i",$newtime->getTimeStamp());
   $times[$j] = $bustime . " (" . $live . ")";
  } 
 }
if (strlen($times[0])<4) $times[0]= $timestamp." (now)";
Note the line with $newtime = clone $rawtime. It is very important that $newtime is cloned, otherwise $newtime->modify(...) will modify $rawtime and our code will work, but will produce a wrong timetable! Also note that the last line is a patch to ensure that "<1 min" is captured as "now" regardless of possible parsing errors upstream (spoiler: there are errors upstream).
As an afterthought, as I have both the number of minutes and time for all departures, I decided to output them both, breaking up the elements and giving them distinct IDs for future access. Here's how:
$css= ' style="font-weight: bold; color:rgb(255,128,128);"';
for ($j=0; $j<$rid; $j++)
{
 if (strpos($times[$j],"(")!==false)
 {$times[$j] = str_replace('(','</span><span id="busreal'.$j.'" '.$css.'>(',$times[$j]); }
 else { $times[$j] =    $times[$j].'</span> <span id="busreal'.$j.'" '.$css.'>('.$nummin[$j].' min)';}
 if ($j==0) $css=str_replace(';"','; font-size: 14px;"',$css);
}
. . .
echo '<span id="bustoken" style="background: ',$color,'; color:white;">', " ".$routes[0], '</span>',  '<span id="bustime0" style="font-weight: bold; color:',$color,'">', " ", $times[0], '</span>', '  <span id="realtime" style="color: ',$color,'; ">[...]</span>';
if ($rid > 1)
{echo "<br/>";
 echo '<span style="color:rgb(255,235,235);">', $timestamp, ' </span>';
 for ($j=1; $j<$rid; $j++)
  {echo '<span id="busnum'.$j.'" style="background: rgb(255,128,128); color:white; font-size: 14px;">', " ".$routes[$j], '</span>',  '<span 
   id="bustime',$j,'"  style="font-weight: bold; color: rgb(255,128,128);font-size: 14px;">', " ", $times[$j], '</span>  ';
  };
} 
So, here is the final result:
And this is how it looks like in an embedded form:
What's with all the different colours, you ask, what does "leave in..." mean, and why is formatting so different? I bet you guessed it: the 16:31 bus is already too soon to catch, and the 16:33 can be caught just barely, if you leave now and make haste. In my next post, I am going to describe how I achieved this on the client side.




 
No comments:
Post a Comment